← All issues

[YARR] Add auto-possession optimization

2a8223d

Source/JavaScriptCore/yarr/YarrPattern.cpp

+ void optimizePossessiveQuantifiers()
+ {
+ auto followerForcesPossessive = [&](const PatternTerm& greedy, const PatternTerm& next) -> bool {
+ if (next.type != PatternTerm::Type::PatternCharacter)
+ return false;
+ if (next.quantityType != QuantifierType::FixedCount || next.quantityMinCount < 1)
+ return false;
+ char32_t fc = next.patternCharacter;
+ if (termMatchesCharacter(greedy, fc) != TriState::False)
+ return false;
+ if (next.ignoreCase()) {
+ if (!isASCII(fc))
+ return false;
+ if (termMatchesCharacter(greedy, toASCIIUpper(fc)) != TriState::False)
+ return false;
+ if (termMatchesCharacter(greedy, toASCIILower(fc)) != TriState::False)
+ return false;
+ }
+ return true;
+ };

Source/JavaScriptCore/yarr/YarrJIT.cpp

+ if (term->possessive()) {
+ loadFromFrame(term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex(), countRegister);
+ if (m_decodeSurrogatePairs && !U_IS_BMP(term->patternCharacter))
+ m_jit.lshift32(MacroAssembler::TrustedImm32(1), countRegister);
+ m_jit.sub32(countRegister, m_regs.index);
+ m_backtrackingState.fallthrough();
+ return;
+ }

YARR (Yet Another Regex Runtime) is WebKit's regex engine. It runs patterns through a parser into a YarrPattern IR (composed of PatternTerm nodes), then either interprets them or compiles them to native code via YarrJIT. Possessive quantifiers match maximally and never give back characters to the following term — unlike greedy quantifiers, which do. Auto-possessification is a static analysis that proves a greedy quantifier can be treated as possessive: if the characters the greedy term can match are fully disjoint from the characters the mandatory following term requires, then every backtrack iteration would fail anyway, so the give-back loop can be omitted.

This commit adds that analysis as optimizePossessiveQuantifiers and the matching JIT code path. The analysis pass inspects compiled PatternTerms and marks greedy single-character terms as possessive when the mandatory following term's character set is provably disjoint, accounting for case folding when the follower is /i. YarrJIT then emits code that skips the backtrack loop for such terms, eliminating futile give-back iterations for patterns like /a+b/ or /[0-9a-f]{1,4}:/.

The correctness of the disjointness analysis is load-bearing: a false positive — concluding two character sets are disjoint when they share even one code point — causes the JIT to silently skip backtracking and produce a wrong non-match, exactly the class of bug that breaks regex-based security validators. The analysis must handle Unicode case folding, non-BMP surrogate pairs, and inverted/variable-width character classes correctly — all of which are subtle and worth attention.

🔒

Several edge cases in the disjointness analysis and JIT give-back paths are worth security investigation.

Subscribe to read more