Finding HTML tags with Regex

introduction

I’ve recently been working on syntax highlighting and building a content pipeline for a website. In both cases, I found myself relying heavily on regex, and in particular on patterns with lazy repetition. This might reflect my own laziness, but it raised a question:

Can we improve performance by using precise character classes, and avoiding laziness?

So, I ran a small benchmark, investigating the actual impact of these choices. In addition, we compare Python's built-in re module to the popular and feature-rich regex package, as well as the impact of the DOTALL flag.

For this experiment, we'll focus on capturing HTML tags, and for simplicity:

Let's capture the tags as a whole, including their attributes (for example <p class="myclass"> is a valid match).
Let's not worry about which tags are opening or closing (for example <p> and </p> are two valid matches)
Let's not worry about validation, for example <12%?hello> will also be considered a match, even though it's not valid HTML.

To try our patterns, we can use an HTML snippet like this:

<div>Hello world</div>

<section id = "main"   class="container" >
   <p style="
        color: red;" >
    This is fine
   </p>
   <img src = "cat.jpg" alt=" a cat " />
</section>

<em
    >this is weird</em
>

It is valid HTML, but with somewhat suspicious formatting:

The section has some extra spaces.
The p has spaces and even a new-line inside.
The em has newlines both in the opening and closing tags.

The idea is that our patterns should be robust enough to catch all these (9) tags.

Method & patterns

In this experiment, let's try regex patterns on the following form:

opening <
some character class repeated with +? (lazy) or + (non-lazy)
closing >

patterns = {
    # 1. Class of "whitespace" and "non-whitespace"
    "lazy_noflag": re.compile(r"<[\s\S]+?>"),
    # 2. Dot needs flag to include newlines
    "lazy_dotall": re.compile(r"<.+?>", re.DOTALL),
    # 3. Negated character class 
    "non_lazy": re.compile(r"<[^>]+>"),
}

To learn more about the these types of patterns, and the potential problems with laziness, the repetitions chapter at Regular-Expressions.info provides some useful details and examples.

Benchmark & results

The following results are from Python 3.13.3, using both the built-in re module, and the regex module (2024.11.6). We measure 1000 runs of the findall() operation, and report the average and standard deviation over 50 repetitions.

In addition, we can use the regex101 to debug our patterns. For example it counts the number of steps taken to process the test string.

Pattern	Time (`re`)	Time (`regex`)	# Steps
"lazy_noflag" `<[\s\S]+?>`	3.06 ms (std: 0.09)	4.92 ms (std: 0.20)	161
"lazy_dotall" `<.+?>`	2.45 ms (std: 0.08)	2.89 ms (std: 0.15)	161
"non_lazy" `<[^>]+>`	1.18 ms (std: 0.06)	2.15 ms (std: 0.07)	36

Key takeaways:

The "non_lazy" pattern (<[^>]+>) is the clear winner, being around 60% faster than "lazy_noflag" (<[\s\S]+?>) for both regex implementations.
The built-in re module also outperforms the third-party regex here, especially for the "non_lazy" pattern where it's ca 45% faster. This is likely because its simpler engine has less overhead when advanced features aren’t needed.

Why?

On a pattern such as <.+?> the regex engine may backtrack. After the < character, the engine lets .+? match one character, then checks if the next character is >. If not, it goes back and extends the match by one more character, and tries again, repeating until it finds a >. This “try, rewind, try again” process, commonly referred to as backtracking, is time-consuming.

The pattern <[^>]+> avoids backtracking. The character class [^>] means “match any character except >,” so the engine can scan forward in a single pass until the first > is reached.

What about an HTML parser

To me, one of the most common problems with regex is edge cases. So for ultimate robustness, an established HTML parser might actually be the tool of choice. Again, relying on the Python standard library we can use this thing:

class SimpleParser(HTMLParser):
    """Simplified HTML parser. only gets start and end tags"""

    def __init__(self) -> None:
        super().__init__()
        self.tags = []

    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)
        # we could reconstruct the attr string like:
        # + " " + " ".join(f'{k}="{v}"' for k, v in attrs))
        # but we can avoid that for a more fair performance comparison

    def handle_endtag(self, tag):
        self.tags.append(f"/{tag}")

    def handle_startendtag(self, tag, attrs):
        # to avoid catching <img/> twice
        if tag[0] != "/":
            self.tags.append(tag)

Then, we can process the text, extracting all tags as follows:

p = SimpleParser()
p.feed(text)
p.close()
print(p.tags)

Note that the results here are slightly different; we ignored attributes and brackets, and unlike regex we don't get the exact positions in the input string.

	Time
HTML parser	26.22 ms (std: 0.49)

Conclusion

So, it turns out regex can work well for finding HTML tags and is generally faster than using a full HTML parser. However, the choices made in the pattern, as well as the implementation used, can have a significant impact on runtime.