Class RubyFlavor

java.lang.Object
com.oracle.truffle.regex.tregex.parser.flavors.RegexFlavor
com.oracle.truffle.regex.tregex.parser.flavors.RubyFlavor

public final class RubyFlavor extends RegexFlavor
An implementation of the Ruby regex flavor.

This implementation supports all Ruby regular expressions with the exception of the following features:

  • \G escape sequence: In Ruby regular expressions, \G can be used to assert that we are still at the initial index. TRegex only provides limited support for this feature by handling cases when \G appears at the beginning of all top-level alternatives.
  • \K keep command: This command can be used in Ruby regular expressions to modify the matcher's state so that it deletes any characters matched so far and considers the current position as the start of the reported match. There is no operator like this in ECMAScript that would allow one to tinker with the matcher's state.
  • Unicode character properties not supported by ECMAScript and not covered by the POSIX character classes: Ruby regular expressions use the syntax \p{...} for Unicode character properties. Similar to ECMAScript, they offer access to Unicode Scripts, General Categories and some character properties. Ruby also allows access to character properties that refer to POSIX character classes (e.g. \p{Alnum} for [[:alnum:]]). We support all of the above, including any character properties specified by Ruby's documentation. However, Ruby regular expressions still have access to extra Unicode character properties (e.g. Age) that we do not support. We could dive through Ruby's implementation to find out which other properties might be used and try providing them too.
  • recursive \g<...> subexpression calls and \k<...+-x> backreferences to other levels
  • \X extended grapheme cluster escapes: This escape sequence expands into a complex expression that references a lot of Unicode character sets that we currently do not have support for in TRegex.
  • (?~...) absent expressions: These constructs can be used in Ruby regular expressions to match strings that do not contain a match for a given expression. TRegex doesn't have support for this kind of construction.

However, there are subtle differences in how some fundamental constructs behave in ECMAScript regular expressions and Ruby regular expressions. This concerns core concepts like loops and capture groups and their interactions. These issues cannot be handled by transpiling alone and they require extra care on the side of TRegex. The issues and the solutions are listed below.

  • backreferences to unmatched capture groups should fail: In ECMAScript, when a backreference is made to a capture group which hasn't been matched, such a backreference is ignored and matching proceeds. If this happens in Ruby, the backreference will fail to match and the search will stop and backtrack.
     Node.js (ECMAScript):
     > /(?:(a)|(b))\1/.exec("b")
     [ 'b', undefined, 'b', index: 0, input: 'b', groups: undefined ]
    
     MRI (Ruby):
     irb(main):001:0> /(?:(a)|(b))\1/.match("b")
     => nil
     
    This is solved in TRegexBacktrackingNFAExecutorNode, by the introduction of the backrefWithNullTargetSucceeds field, which controls how backreferences to unmatched capture groups are resolved. Also, in JSRegexParser, an optimization that drops forward references and nested references from ECMAScript regular expressions is turned off for Ruby regular expressions.
  • re-entering a loop should not reset enclosed capture groups: In ECMAScript, when a group is re-entered while looping, all of the capture groups contained within the looping group are reset. On the other hand, in Ruby, their contents are preserved from one iteration of the loop to the next. As we see in the example below, ECMAScript drops the contents of the (a) capture group, while Ruby keeps it.
     Node.js (ECMAScript):
     > /((a)|(b))+/.exec("ab")
     [ 'ab', 'b', undefined, 'b', index: 0, input: 'ab', groups: undefined ]
    
     MRI (Ruby):
     irb(main):001:0> /((a)|(b))+/.match("ab")
     => #<MatchData "ab" 1:"b" 2:"a" 3:"b">
     
    This is solved in NFATraversalRegexASTVisitor. The method getGroupBoundaries is modified so that the instructions for clearing enclosed capture groups are omitted from generated NFA transitions when processing Ruby regular expressions.
  • loops should be repeated as long as the state of capture groups evolves: In ECMAScript, when a loop matches the minimum required number of iterations, any further iterations are only matched provided they consume some characters from the input. This is a measure intended to stop infinite loops once they no longer consume any input. Ruby has a similar guard, but it admits extra iterations if they either consume characters or change the state of capture groups. Thus it is possible to have extra iterations that don't consume any characters but that store empty strings as matches of capture groups. In the example below, ECMAScript executes the outer ? loop zero times, since executing it once would consume no characters. As a result, the contents of capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution modifies the contents of capture group 1 so that it contains the empty string.
     Node.js (ECMAScript):
     > /(a*)? /.exec("")
     [ '', undefined, index: 0, input: '', groups: undefined ]
    
     MRI (Ruby):
     irb(main):001:1" /(a*)?/.match("")
     => #<MatchData "" 1:"">
     
    This is solved by permitting one extra empty iteration of a loop when traversing the AST and generating the NFA. In the absence of backreferences, an extra empty iteration is sufficient, because any other iteration on top of that will retread the same path and have no further effects. With backreferences (or more specifically, forward references), it is possible to create situations where several empty iterations are required, sometimes even in the middle of a loop, as in the example below.
     irb(main):001:0> / (a|\2b|\3()|())* /x.match("aaabbb")
     => #<MatchData "aaabbb" 1:"" 2:"" 3:"">
     
    In NFATraversalRegexASTVisitor, we let NFA transitions pass through one empty iteration of a loop (extraEmptyLoopIterations in NFATraversalRegexASTVisitor#doAdvance). This generates an extra empty iteration at the end of loops and it also gives correct behavior on constructions such as the one given above, as it lets us generate transitions that use an extra empty iteration though the loop to populate some new capture group and then arrive at a new backreference node. Since a single NFA transition can now correspond to more complex paths through the AST, we also need to change the way we check the guards that the transitions are annotated with by interleaving the state changes and assertions (see the use of TRegexBacktrackingNFAExecutorNode#transitionMatchesStepByStep). We also need to implement the empty check, by verifying the state of the capture groups on top of verifying the current index (see TRegexBacktrackingNFAExecutorNode#monitorCaptureGroupsInEmptyCheck). For that, we need fine-grained information about capture group updates and so we include this information in the transition guards by TransitionGuard.createUpdateCG(int). In unrolled loops, we disable empty checks altogether (in JSRegexParser, in the calls to RegexParser#createOptional). This is correct since Ruby's empty checks terminate a loop only when it reaches a fixed point w.r.t. to any observable state. Finally, also in JSRegexParser. we also disable an optimization that drops zero-width groups and lookaround assertions with optional quantifiers.
  • failing the empty check should lead to matching the sequel of the quantified expression instead of backtracking: In ECMAScript, when a loop fails the empty check (an iteration matches only the empty string), the engine terminates the loop by rejecting this branch and backtracking to another alternative (eventually backtracking to the point where it chooses not to re-enter the loop and consider it finished). On the other hand, in Ruby, when a loop fails the empty check (an iteration matches only the empty string and it does not modify the state of the capture groups), the engine continues with the current branch by proceeding to the continuation of the loop. Most notably, it doesn't try to backtrack and alter decisions made inside the loop until some future failure forces it to. This can be illustrated on the following example, where ECMAScript will backtrack into the loop and choose the second alternative, whereas Ruby will proceed with the empty match.
     Node.js (ECMAScript)
     > /(?:|a)?/.exec('a')
     [ 'a', index: 0, input: 'a', groups: undefined ]
    
     MRI (Ruby):
     irb(main):001:0> /(?:|a)?/.match('a')
     => #<MatchData "">
     
    We implement this in NFATraversalRegexASTVisitor by introducing two transitions whenever we leave a loop, one leading to the start of the loop (empty check passes) and one escaping past the loop (empty check fails). The two transitions are then annotated with complementary guards (TransitionGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier) and TransitionGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier), respectively), so that at runtime, only one of the two transitions will be admissible.