3

Let's say I have a file that stores information about people, and one of the lines look like this:

Sweeper 30 1992-09-22 China/Beijing - 0 2020-07-07 Mary/Linda - Pizza/Lemon

From left to right, it's name, age, date of birth, country of birth, city of birth, number of children, date of marriage (optional), wife's name (optional), ex-wife's name (optional), favourite food, least favourite food.

I want to get all the information from the line using the Swift 5.7 RegexBuilder module, I tried:

let regex = Regex {
    /([a-zA-Z ]+)/ // Name
    " "
    TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Age
    " "
    Capture(.iso8601Date(timeZone: .gmt)) // Date of Birth
    " "
    /([a-zA-Z ]+)/ // Country of Birth
    "/"
    /([a-zA-Z ]+)/ // City of Birth
    " - "
    TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Children Count
    Optionally {
        " "
        Capture(.iso8601Date(timeZone: .gmt)) // Date of Marriage
        Optionally {
            " "
            /([a-zA-Z ]+)/ // Wife
            Optionally {
                "/"
                /([a-zA-Z ]+)/ // Ex-wife
            }
        }
    }
    " - "
    /([a-zA-Z ]+)/ // Favourite food
    "/"
    /([a-zA-Z ]+)/ // Least Favourite Food
}

However, Swift says that it is unable to type check this in reasonable time.

I know the reason this happens is because RegexComponentBuilder (the result builder for regex components) only has overloads for up to 10 "C"s or something like that (not too sure on the details):

static func buildPartialBlock<W0, W1, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, R0, R1>(
    accumulated: R0,
    next: R1) -> Regex<(Substring, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10)> where R0 : RegexComponent, R1 : RegexComponent, R0.RegexOutput == (W0, C1, C2, C3), R1.RegexOutput == (W1, C4, C5, C6, C7, C8, C9, C10
)

If I make all the Optionally parts required, the error message becomes a bit more apparent.

Ambiguous use of 'buildPartialBlock(accumulated:next:)'

SwiftUI has a similar problem, where the number of views in a view builder cannot exceed 10, in which case you just use a Group to make some of the views a single view. Can you do something similar in RegexBuilder? Make some of the captures a single capture? It seems to have something to do with AnyRegexOutput, but I'm not sure how to use it.

How do I resolve this compiler error?


To avoid an XY problem:

I have a data file where the data is formatted very haphazardly, i.e. not very machine-readable at all like CSV or JSON. Lines are written in all sorts of formats. Random delimiters are used in random places.

Then another line in the file would have the same information, but formatted in a different way.

What I want to do is to convert this weirdly formatted file into a easy-to-work-with format, like CSV. I've decided to do this with the Swift 5.7 RegexBuilder API. I would find a line in the file, write a regex that match that line, convert all the lines of the file that match that regex to CSV, then rinse and repeat.

Therefore, I would like to avoid using multiple regexes to parse a single line, as this would mean that I would be writing a lot more regexes.

I'm not sure if a parser like ANTLR4 would solve my problem. Given how randomly the file is formatted, I would need to be changing the parser a lot, causing the files to be generated again and again. I don't think that will be as convenient as using RegexBuilder.

Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • Don't know that regex engine but having `(.+?)` 4 times leads to 'Catastrophic backtracking` which ends in timeout. – Poul Bak Sep 22 '22 at 02:36
  • @PoulBak Thats Swift 5.7 • Xcode 14. It is native Swift. Check [Swift Evolution Proposal 0350 regex type](https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md) – Leo Dabus Sep 22 '22 at 02:38
  • 1
    @LeoDabus: Yeah, but still don't know it. – Poul Bak Sep 22 '22 at 02:39
  • @PoulBak I have no idea how to overcome that limitation. SwiftUI has a similar issue that you can Group the views to solve it. Maybe there is a similar approach here as well. – Leo Dabus Sep 22 '22 at 02:49
  • @PoulBak Does `.+?` cause catastrophic backtracking? I thought it was `.+` that does. I ran a version of the regex with only 10 captures on both matching and non matching strings and it was not too slow. – Sweeper Sep 22 '22 at 02:51
  • Well, when it can't find a match, it will try with a longer match, until it finds a match, it's just that it will match as few as possible instead of as many as possible. – Poul Bak Sep 22 '22 at 02:56
  • @PoulBak In any case, since that's kind of irrelevant to the question, I've edited it out :) – Sweeper Sep 22 '22 at 02:56
  • As I said, I don't know Swift, but in other regexes you can use `Named groups`, which would eliminate that limit. Does that exist in Swift? – Poul Bak Sep 22 '22 at 03:42
  • @PoulBak I don't think they exist in RegexBuilder, but you can use them in regex literals. The limit for the number of groups in a regex literal is also bigger, but I can't use them here because I want to use the localised date and currency parsers, which are also type safe, provided by RegexBuilder. – Sweeper Sep 22 '22 at 03:49
  • I'm unsure whether or not it's actually the limit of 10 you're hitting since Regex uses buildPartialBlock under the hood. I did some experiments with the code you pasted and after tweaking around a bit I get the underlying problem that there's an "ambiguous use of 'buildPartialBlock(accumulated:next:)'". Which is when two potentially implementations of that both match the input. I hope this helps you in your search. I'll look further and post if I find anything. – Theis Egeberg Sep 22 '22 at 05:04

1 Answers1

0

As a hack, you can create a generalised CustomConsumingRegexComponent implementation that takes in

  • any RegexComponent built from a builder, which always has a (Substring, A, B, C ...) tuple as output
  • a transformation that transforms that tuple to a type T that we desire

We can basically create a regex component that takes in some regex and outputs any type T we want, essentially "grouping" the captures.

It's also possible to just not do the transformation, and you'd end up with nested tuples, but I don't like that.

struct Group<RegexOutput, Component: RegexComponent>: CustomConsumingRegexComponent {

    let component: () -> Component
    
    let transform: (Component.RegexOutput) -> RegexOutput
    
    init(@RegexComponentBuilder _ regexBuilder: @escaping () -> Component, transform: @escaping (Component.RegexOutput) -> RegexOutput) {
        component = regexBuilder
        self.transform = transform
    }
    
    func consuming(_ input: String, startingAt index: String.Index, in bounds: Range<String.Index>) throws -> (upperBound: String.Index, output: RegexOutput)? {
        let innerRegex = Regex(component)
        guard let match = input[index...].prefixMatch(of: innerRegex) else { return nil }
        let upperBound = match.range.upperBound
        let output = match.output
        let transformedOutput = transform(output)
        return (upperBound, transformedOutput)
    }
}

The reason why this is only a hack, is because the regex inside the Group doesn't actually know about the stuff outside the Group, so quantifiers inside the Group won't backtrack to try to match the stuff outside the Group.

For example, to fix the code in the question, I can put all the marriage-related info into a Group, but I have to add a lookahead inside the Group:

struct Marriage {
    let marriageDate: Date
    let wife: Substring?
    let exWife: Substring?
}

let r = Regex {
    /([a-zA-Z ]+)/ // Name
    " "
    TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Age
    " "
    Capture(.iso8601Date(timeZone: .gmt)) // Date of Birth
    " "
    /([a-zA-Z ]+)/ // Country of Birth
    "/"
    /([a-zA-Z ]+)/ // City of Birth
    " - "
    TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Children Count

    Optionally {
        " "
        Capture(Group {
            Capture(.iso8601Date(timeZone: .gmt)) // Date of Marriage
            Optionally {
                " "
                /([a-zA-Z ]+)/ // Wife
                Optionally {
                    "/"
                    /([a-zA-Z ]+)/ // Ex-wife
                }
            }
            Lookahead(" - ")
        } transform: { (_, date, wife, exWife) in
            Marriage(marriageDate: date, wife: wife, exWife: exWife as? Substring) // unwrap the double optional
        })
    }
    " - "
    /([a-zA-Z ]+)/ // Favourite food
    "/"
    /([a-zA-Z ]+)/ // Least Favourite Food
}

Without the lookahead, this is what happens:

The innermost [a-zA-Z ]+ would match Linda, and also the space after it, causing " - " to not match. Normally, this would cause backtracking, but since things inside the Group doesn't know about things outside the Group, backtracking does not occur here, and the whole match fails.

Sweeper
  • 213,210
  • 22
  • 193
  • 313