Finding HTML tags with Regex
introduction
I’ve recently been working on syntax highlighting and building a content pipeline for a website. In both cases, I found myself relying heavily on regex, and in particular on patterns with lazy repetition. This might reflect my own laziness, but it raised a question:
Can we improve performance by using precise character classes, and avoiding laziness?
So, I ran a small benchmark, investigating the actual impact of these
choices. In addition, we compare Python's built-in
re module to the popular and feature-rich
regex package, as well as the impact of the
DOTALL flag.
For this experiment, we'll focus on capturing HTML tags, and for simplicity:
-
Let's capture the tags as a whole, including their attributes (for
example
<p class="myclass">is a valid match). -
Let's not worry about which tags are opening or closing (for example
<p>and</p>are two valid matches) -
Let's not worry about validation, for example
<12%?hello>will also be considered a match, even though it's not valid HTML.
To try our patterns, we can use an HTML snippet like this:
<div>Hello world</div>
<section id = "main" class="container" >
<p style="
color: red;" >
This is fine
</p>
<img src = "cat.jpg" alt=" a cat " />
</section>
<em
>this is weird</em
>
It is valid HTML, but with somewhat suspicious formatting:
- The
sectionhas some extra spaces. - The
phas spaces and even a new-line inside. -
The
emhas newlines both in the opening and closing tags.
The idea is that our patterns should be robust enough to catch all these (9) tags.
Method & patterns
In this experiment, let's try regex patterns on the following form:
- opening
< -
some character class repeated with
+?(lazy) or+(non-lazy) - closing
>
patterns = {
# 1. Class of "whitespace" and "non-whitespace"
"lazy_noflag": re.compile(r"<[\s\S]+?>"),
# 2. Dot needs flag to include newlines
"lazy_dotall": re.compile(r"<.+?>", re.DOTALL),
# 3. Negated character class
"non_lazy": re.compile(r"<[^>]+>"),
}
To learn more about the these types of patterns, and the potential problems with laziness, the repetitions chapter at Regular-Expressions.info provides some useful details and examples.
Benchmark & results
The following results are from Python 3.13.3, using both the
built-in re module, and the regex module
(2024.11.6). We measure 1000 runs of the
findall() operation, and report the average and standard
deviation over 50 repetitions.
In addition, we can use the regex101 to debug our patterns. For example it counts the number of steps taken to process the test string.
| Pattern | Time (re) |
Time (regex) |
# Steps |
|---|---|---|---|
"lazy_noflag" <[\s\S]+?> |
3.06 ms (std: 0.09) | 4.92 ms (std: 0.20) | 161 |
"lazy_dotall" <.+?> |
2.45 ms (std: 0.08) | 2.89 ms (std: 0.15) | 161 |
"non_lazy" <[^>]+> |
1.18 ms (std: 0.06) | 2.15 ms (std: 0.07) | 36 |
Key takeaways:
-
The "non_lazy" pattern (
<[^>]+>) is the clear winner, being around 60% faster than "lazy_noflag" (<[\s\S]+?>) for both regex implementations. - The built-in re module also outperforms the third-party regex here, especially for the "non_lazy" pattern where it's ca 45% faster. This is likely because its simpler engine has less overhead when advanced features aren’t needed.
Why?
On a pattern such as <.+?> the regex engine may
backtrack. After the < character, the
engine lets .+? match one character, then checks if the
next character is >. If not, it goes back and extends
the match by one more character, and tries again, repeating until it
finds a >. This “try, rewind, try again” process,
commonly referred to as backtracking, is time-consuming.
The pattern <[^>]+> avoids backtracking. The
character class [^>] means “match any character except
>,” so the engine can scan forward in a single pass
until the first > is reached.
What about an HTML parser
To me, one of the most common problems with regex is edge cases. So for ultimate robustness, an established HTML parser might actually be the tool of choice. Again, relying on the Python standard library we can use this thing:
class SimpleParser(HTMLParser):
"""Simplified HTML parser. only gets start and end tags"""
def __init__(self) -> None:
super().__init__()
self.tags = []
def handle_starttag(self, tag, attrs):
self.tags.append(tag)
# we could reconstruct the attr string like:
# + " " + " ".join(f'{k}="{v}"' for k, v in attrs))
# but we can avoid that for a more fair performance comparison
def handle_endtag(self, tag):
self.tags.append(f"/{tag}")
def handle_startendtag(self, tag, attrs):
# to avoid catching <img/> twice
if tag[0] != "/":
self.tags.append(tag)
Then, we can process the text, extracting all tags as follows:
p = SimpleParser()
p.feed(text)
p.close()
print(p.tags)
Note that the results here are slightly different; we ignored attributes and brackets, and unlike regex we don't get the exact positions in the input string.
| Time | |
|---|---|
| HTML parser | 26.22 ms (std: 0.49) |
Conclusion
So, it turns out regex can work well for finding HTML tags and is generally faster than using a full HTML parser. However, the choices made in the pattern, as well as the implementation used, can have a significant impact on runtime.