Safe Regex – REDoS protection

Regular expressions, or regex, are patterns used commonly for string matching. Although there are several flavors for the regex syntax, the basic building blocks of the pattern language stay roughly the same. While the use of regex for string matching is wide spread, it’s somewhat complicated syntax earned it several jokes, as in this xkcd, or in a comment I recently heard saying that “regex are write-only”, and are tough to read.

What most people don’t know is that the use of regex patterns, that are usually deployed on large input strings, can be sometimes vulnerable to regex Denial of Service (REDoS). A detailed example can be found in the postmortem report about the Stack Exchange outage from last July.

During my B.s.c at TAU (2010) I took a workshop project aiming to mitigate this threat, as part of a project suggested by a major software company. The goal of the project was to extend apache’s regular expression engine, then a part of the jakarta project, to support Thompson’s NFA. In the remaining of this blog post I will explain the cause of the REDoS vulnerability, Thompson’s solution, and our implementation along with key notes about it.

Regular Expressions 101:

The basic primitives of the regex syntax are (there are more primitives):

  • (dog)? – match at most one (single or none) occurrences of “dog”
  • dog | cat – match the string “dog” or the string “cat”
  • (dog)* – match any (none or more) occurrence of the “dog” string
  • (dog)+ – match at least one (one or more)  occurrences of the “dog” string

In addition, the matching is greedy by nature, and so defines operator precedence. For example, the regex “( a | b ) * ( b ) *” with the input “bbbbb” will match all of the b’s to the 1st operator, leaving the 2nd operator with no matched input. Adding the “?” after a given operator will make him “reluctant”, i.e. not greedy.

State machine and REDoS:

Since regular expressions defines a pattern to be used for string matching, their backend engine is commonly implemented using 2 main phases:

  1. Compilation – converting the pattern into a state machine
  2. Matching – running an interpreter on the input and the state machine

We will define the following annotations:

  • n – number of character in the input
  • m – number of states in the state machine
  • a{k} – k times the string “a”

We have seen that the regex operators can define multiple choices, and so the ideal state machine will be a Non-Deterministic Finite Automata (NFA). However, most common implementations defines only a “logical” NFA, while simulating it with a deterministic interpreter, with some rules defining it’s branch decisions, and backtracking.

To sum it up, since the default regex operators are greedy, the interpreter will first try to match the greedy option of the interpreter, and only after a failure it will backtrack to try the reluctant option. For example:

  • pattern: a?{n}a{n}
  • input: a{n} (n times the string “a”)

The interpreter will try to match each “a” character to to the matching “?” operator, thus making the suffix string (of length n) unmatched. Once failing, the interpreter will backtrack it’s last decision and “give up” the character to match it by the suffix. Since it still won’t suffice, he will again backtrack, to the 2nd last decision, and will retry to greedily match the next “a” character by the operator… Only on it’s last attempt to reluctantly “give up” all of the split decisions, he will finally be able to match the input against the pattern.

Although the state machine’s size is only m=O(n) the interpreter’s running time will be exponential, traversing a complete binary decision tree of depth n. This backtracking behavior makes the running time possibly exponential, thus creating a DoS vulnerability while the program is busy matching an adversarial input string.

Thompson’s NFA:

In an article published in the late 60’s, Ken Thompson suggested an alternative approach for regex matching. Instead of using only a partial state machine, and running a state-full interpreter on it, he suggested to run a state-less interpreter on a complete NFA. For instance, the “?” operator can be described as:

or_operator_nfa

One option can try and take the string, and the second option will continue on.

Now that a full state machine has been constructed, Thompson’s defines that the interpreter will simply traverse the state machine using the following rules:

  • The interpreter will have a list of occupied states, since it can be simultaneously in more than one state
  • On every decision arrow the interpreter will split and take all matching splits
  • The interpreter will start by moving from the initial state to all states reachable without any input character
  • On every input character, the interpreter will advance all of it’s occupied states through the state machine
  • The input string will be considered a “match” if at the end the ending state will be one of the occupied states

This algorithm uses the fact that although there might be exponential many ways to reach a given state, once the interpreter arrived to it, the actual path taken is irrelevant. This makes the algorithm polynomial, since there is an easy upper bound of O(nm):

  • The interpreter can occupy at most m states on each iteration
  • There are n+1 iterations

Our implementation – Theory vs Reality:

In our implementation, my project partner, Tomer Levy, and I have started by converting Jakarta’s logical state machine into a complete state machine. Although there were some technical details required, adding a backward split edge and supporting loops for instance, the implementation was quite straightforward. Here is an example of the node representation to the “” operator, that was previously only described as a single “” node:

safere_state_machine

An important issue regarding regex string matching, is that the pattern can define clauses, and it is important to retrieve what strings were matched by each clause. Returning to the original example, the 1st clause () matched all of the input: “bbbbb”, while the 2nd clause matched the empty string: “”. This means that we must know what path has the interpreter taken when reaching a given state, otherwise we won’t be able to track back the clause’s content. This somewhat contradicted Thompson’s original algorithm, making a gap between the theoretical algorithm and the functionality we must support in our implementation.

We managed to overcome this problem by supplying each occupied state with a “heritage”, a trace of what has it matched in all previous clauses. This will be useful since the operator precedence of the regex pattern defines a test to decide which of several given state machine paths is considered “better”. And now instead of only occupying a state, which is a binary status of occupied / empty, we let each candidate path fight it’s way into the state. The most suitable path, the one with the better clause matches (heritage), will be the one in control of the occupied state, thus staying in the original algorithmic bounds of one remembered path per state per iteration.

This solution makes use of the fact that at any decision to split one way or the other, can instantly be ranked as “preferable” or “worthless”. This means that there is no danger that a fight over a single state will eliminate a candidate path that later will be considered better.

Functionality note:

We have managed to support all of the known regex syntax available in other libraries, except for the backreference syntax. This syntax declares that “(cat | dog) \1” will only be able to match “catcat” and “dogdog” and it is known to be NP-Hard to support.

Conclusion:

I recommend anyone interested to read more about our project, and our implementation, to take a look at the workshop’s results, which can be found here.

Overall, for any practical implementations, it probably preferable to first find out if  backreferences are needed or not, and then to check if the wanted regex library supports a “safe” implementation that avoids the risk of the REDoS vulnerability.

For more reading about this topic, and for some useful regex tools, here are some handy links:

Author: eyalitkin

White hat security researcher. Recently finished my M.s.c at TAU, and now focus on security research, mainly in open sources.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: