ReDoS Vulnerability: JavaScript Vs PHP Regex Explained
Hey guys! Ever stumbled upon a piece of code that behaves differently in different environments and left you scratching your head? Today, we're diving into a fascinating case: a regular expression (regex) for email validation that's vulnerable to ReDoS (Regular Expression Denial of Service) in JavaScript but not in PHP. Let's break this down, make it super clear, and see what's going on under the hood.
Understanding the Regex
Before we jump into the vulnerability, let's get comfy with the regex itself. The beast in question, taken from a Stack Overflow answer on PHP email validation (with the ^$
anchors removed), looks like this:
/(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){64,}@)/
Woah, that's a mouthful! Don't worry, we'll dissect it. This regex aims to validate email addresses, but it's doing it in a rather complex and, as we'll see, problematic way. At its heart, it's trying to ensure that certain parts of the email address (before the @
symbol and the overall length) don't exceed specific limits. Specifically, it's checking that there aren't more than 255 characters in certain parts and that there aren't more than 64 characters before the @
symbol. To truly grasp this, we must dissect its components meticulously. The regex employs negative lookaheads (?!...)
which assert that a certain pattern does not match at the current position. This is a crucial element in its validation strategy, aiming to disqualify strings that exceed length constraints or contain invalid sequences. However, it is precisely these lookaheads, combined with the intricate alternation of character patterns, that lay the groundwork for potential performance bottlenecks and vulnerabilities, particularly when faced with crafted input strings. The use of non-capturing groups (?:...)
further organizes the pattern, grouping alternatives without the overhead of capturing the matched substrings, which is a common optimization technique in regex design, yet it does not mitigate the inherent complexity that contributes to ReDoS vulnerability. The character classes [\x00-\x7E]
are used to match ASCII characters, indicating an attempt to enforce character set restrictions, an important consideration for email address validation. This restriction, however, does not prevent the vulnerability because the core issue lies in the structural complexity of the pattern and how it interacts with the regex engine's backtracking mechanism. Therefore, understanding each part of the regex is just the first step. The real key to understanding the vulnerability lies in understanding how the regex engine processes this pattern and how certain inputs can cause it to go into overdrive.
Breaking Down the Beast
Let's simplify it a bit:
(?!...)
: This is a negative lookahead. It checks if a pattern doesn't match at the current position.(?:...)
: This is a non-capturing group. It groups parts of the regex without saving the matched text.\x22
: This matches a double quote (`