Class RubyFlavor
java.lang.Object
com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavor
com.oracle.truffle.regex.tregex.parser.flavors.RubyFlavor
An implementation of the Ruby regex flavor.
This implementation supports all Ruby regular expressions with the exception of the following features:
- \G escape sequence: In Ruby regular expressions, \G can be used to assert that we are still at the initial index. TRegex only provides limited support for this feature by handling cases when \G appears at the beginning of all top-level alternatives.
- \K keep command: This command can be used in Ruby regular expressions to modify the matcher's state so that it deletes any characters matched so far and considers the current position as the start of the reported match. There is no operator like this in ECMAScript that would allow one to tinker with the matcher's state.
- Unicode character properties not supported by ECMAScript and not covered by the POSIX character classes: Ruby regular expressions use the syntax \p{...} for Unicode character properties. Similar to ECMAScript, they offer access to Unicode Scripts, General Categories and some character properties. Ruby also allows access to character properties that refer to POSIX character classes (e.g. \p{Alnum} for [[:alnum:]]). We support all of the above, including any character properties specified by Ruby's documentation. However, Ruby regular expressions still have access to extra Unicode character properties (e.g. Age) that we do not support. We could dive through Ruby's implementation to find out which other properties might be used and try providing them too.
- recursive \g<...> subexpression calls and \k<...+-x> backreferences to other levels
- \X extended grapheme cluster escapes: This escape sequence expands into a complex expression that references a lot of Unicode character sets that we currently do not have support for in TRegex.
- (?~...) absent expressions: These constructs can be used in Ruby regular expressions to match strings that do not contain a match for a given expression. TRegex doesn't have support for this kind of construction.
However, there are subtle differences in how some fundamental constructs behave in ECMAScript regular expressions and Ruby regular expressions. This concerns core concepts like loops and capture groups and their interactions. These issues cannot be handled by transpiling alone and they require extra care on the side of TRegex. The issues and the solutions are listed below.
- backreferences to unmatched capture groups should fail: In ECMAScript, when a backreference
is made to a capture group which hasn't been matched, such a backreference is ignored and
matching proceeds. If this happens in Ruby, the backreference will fail to match and the search
will stop and backtrack.
Node.js (ECMAScript): > /(?:(a)|(b))\1/.exec("b") [ 'b', undefined, 'b', index: 0, input: 'b', groups: undefined ] MRI (Ruby): irb(main):001:0> /(?:(a)|(b))\1/.match("b") => nilThis is solved inTRegexBacktrackingNFAExecutorNode, by the introduction of thebackrefWithNullTargetSucceedsfield, which controls how backreferences to unmatched capture groups are resolved. Also, inJSRegexParser, an optimization that drops forward references and nested references from ECMAScript regular expressions is turned off for Ruby regular expressions. - re-entering a loop should not reset enclosed capture groups: In ECMAScript, when a group is
re-entered while looping, all of the capture groups contained within the looping group are reset.
On the other hand, in Ruby, their contents are preserved from one iteration of the loop to the
next. As we see in the example below, ECMAScript drops the contents of the
(a)capture group, while Ruby keeps it.Node.js (ECMAScript): > /((a)|(b))+/.exec("ab") [ 'ab', 'b', undefined, 'b', index: 0, input: 'ab', groups: undefined ] MRI (Ruby): irb(main):001:0> /((a)|(b))+/.match("ab") => #<MatchData "ab" 1:"b" 2:"a" 3:"b">This is solved inNFATraversalRegexASTVisitor. The methodgetGroupBoundariesis modified so that the instructions for clearing enclosed capture groups are omitted from generated NFA transitions when processing Ruby regular expressions. - loops should be repeated as long as the state of capture groups evolves: In ECMAScript, when
a loop matches the minimum required number of iterations, any further iterations are only matched
provided they consume some characters from the input. This is a measure intended to stop infinite
loops once they no longer consume any input. Ruby has a similar guard, but it admits extra
iterations if they either consume characters or change the state of capture groups. Thus it is
possible to have extra iterations that don't consume any characters but that store empty strings
as matches of capture groups. In the example below, ECMAScript executes the outer
?loop zero times, since executing it once would consume no characters. As a result, the contents of capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution modifies the contents of capture group 1 so that it contains the empty string.Node.js (ECMAScript): > /(a*)? /.exec("") [ '', undefined, index: 0, input: '', groups: undefined ] MRI (Ruby): irb(main):001:1" /(a*)?/.match("") => #<MatchData "" 1:"">This is solved by permitting one extra empty iteration of a loop when traversing the AST and generating the NFA. In the absence of backreferences, an extra empty iteration is sufficient, because any other iteration on top of that will retread the same path and have no further effects. With backreferences (or more specifically, forward references), it is possible to create situations where several empty iterations are required, sometimes even in the middle of a loop, as in the example below.irb(main):001:0> / (a|\2b|\3()|())* /x.match("aaabbb") => #<MatchData "aaabbb" 1:"" 2:"" 3:"">InNFATraversalRegexASTVisitor, we let NFA transitions pass through one empty iteration of a loop (extraEmptyLoopIterationsinNFATraversalRegexASTVisitor#doAdvance). This generates an extra empty iteration at the end of loops and it also gives correct behavior on constructions such as the one given above, as it lets us generate transitions that use an extra empty iteration though the loop to populate some new capture group and then arrive at a new backreference node. Since a single NFA transition can now correspond to more complex paths through the AST, we also need to change the way we check the guards that the transitions are annotated with by interleaving the state changes and assertions (see the use ofTRegexBacktrackingNFAExecutorNode#transitionMatchesStepByStep). We also need to implement the empty check, by verifying the state of the capture groups on top of verifying the current index (seeTRegexBacktrackingNFAExecutorNode#monitorCaptureGroupsInEmptyCheck). For that, we need fine-grained information about capture group updates and so we include this information in the transition guards byTransitionGuard.createUpdateCG(int). In unrolled loops, we disable empty checks altogether (inJSRegexParser, in the calls toRegexParser#createOptional). This is correct since Ruby's empty checks terminate a loop only when it reaches a fixed point w.r.t. to any observable state. Finally, also inJSRegexParser. we also disable an optimization that drops zero-width groups and lookaround assertions with optional quantifiers. - failing the empty check should lead to matching the sequel of the quantified expression
instead of backtracking: In ECMAScript, when a loop fails the empty check (an iteration matches
only the empty string), the engine terminates the loop by rejecting this branch and backtracking
to another alternative (eventually backtracking to the point where it chooses not to re-enter the
loop and consider it finished). On the other hand, in Ruby, when a loop fails the empty check (an
iteration matches only the empty string and it does not modify the state of the capture groups),
the engine continues with the current branch by proceeding to the continuation of the loop. Most
notably, it doesn't try to backtrack and alter decisions made inside the loop until some future
failure forces it to. This can be illustrated on the following example, where ECMAScript will
backtrack into the loop and choose the second alternative, whereas Ruby will proceed with the
empty match.
Node.js (ECMAScript) > /(?:|a)?/.exec('a') [ 'a', index: 0, input: 'a', groups: undefined ] MRI (Ruby): irb(main):001:0> /(?:|a)?/.match('a') => #<MatchData "">We implement this inNFATraversalRegexASTVisitorby introducing two transitions whenever we leave a loop, one leading to the start of the loop (empty check passes) and one escaping past the loop (empty check fails). The two transitions are then annotated with complementary guards (TransitionGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier)andTransitionGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier), respectively), so that at runtime, only one of the two transitions will be admissible.
-
Nested Class Summary
Nested classes/interfaces inherited from class com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavor
RegexFlavor.EqualsIgnoreCasePredicate -
Field Summary
FieldsFields inherited from class com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavor
BACKREFERENCE_IGNORE_CASE_MULTI_CHAR_EXPANSION, BACKREFERENCES_TO_UNMATCHED_GROUPS_FAIL, EMPTY_CHECKS_MONITOR_CAPTURE_GROUPS, EMPTY_CHECKS_ON_MANDATORY_LOOP_ITERATIONS, FAILING_EMPTY_CHECKS_DONT_BACKTRACK, HAS_CONDITIONAL_BACKREFERENCES, LOOKBEHINDS_RUN_LEFT_TO_RIGHT, NEEDS_GROUP_START_POSITIONS, NESTED_CAPTURE_GROUPS_KEPT_ON_LOOP_REENTRY, SUPPORTS_RECURSIVE_BACKREFERENCES, USES_LAST_GROUP_RESULT_FIELD -
Method Summary
Modifier and TypeMethodDescriptioncreateParser(RegexLanguage language, RegexSource source, CompilationBuffer compilationBuffer) createValidator(RegexLanguage language, RegexSource source, CompilationBuffer compilationBuffer) Methods inherited from class com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavor
backreferenceIgnoreCaseMultiCharExpansion, backreferencesToUnmatchedGroupsFail, canHaveEmptyLoopIterations, emptyChecksMonitorCaptureGroups, emptyChecksOnMandatoryLoopIterations, failingEmptyChecksDontBacktrack, hasConditionalBackReferences, lookBehindsRunLeftToRight, matchesTransitionsStepByStep, needsGroupStartPositions, nestedCaptureGroupsKeptOnLoopReentry, supportsRecursiveBackreferences, usesLastGroupResultField
-
Field Details
-
INSTANCE
-
UNICODE
-
-
Method Details
-
createValidator
public RegexValidator createValidator(RegexLanguage language, RegexSource source, CompilationBuffer compilationBuffer) - Specified by:
createValidatorin classRegexFlavor
-
createParser
public RegexParser createParser(RegexLanguage language, RegexSource source, CompilationBuffer compilationBuffer) - Specified by:
createParserin classRegexFlavor
-
getEqualsIgnoreCasePredicate
- Specified by:
getEqualsIgnoreCasePredicatein classRegexFlavor
-
getCaseFoldAlgorithm
- Specified by:
getCaseFoldAlgorithmin classRegexFlavor
-