-1

I know that NSRegularExpression works on Unicode code points and (normal) JavaScript regex works on UTF-16 code units, but I don't know what should I change in my regex.

Regex: <text[^>]+>([^<]+)<\/text>

Works here: regex101

My parsing method:

func parseCaptions(text: String) -> String? {
        let textRange = NSRange(location: 0, length: text.count)
        let regex = try! NSRegularExpression(pattern: "<text[^>]+>([^<]+)<\\/text>")
        let matches = regex.matches(in: text, range: textRange)
        
        var result: String?
        
        for match in matches {
            let range = match.range
            
            let first = text.index(text.startIndex, offsetBy: range.location)
            let last = text.index(text.startIndex, offsetBy: range.location + range.length)
            
            var string = String(text[first...last])
            
            string = string.replacingOccurrences(of: "\n", with: " ")
            string = string.replacingOccurrences(of: "&amp;#39;", with: "'")
            string = string.replacingOccurrences(of: "&amp;quot;", with: "\"")
            string.append("\n")
            
            result = string
        }
        
        return result
    }
Dewerro
  • 371
  • 3
  • 12

1 Answers1

1

It's not the Regex the issue, it's what you do with the matches.

You do:

var result: String?

for match in matches {
    let range = match.range
    let first = text.index(text.startIndex, offsetBy: range.location)
    let last = text.index(text.startIndex, offsetBy: range.location + range.length)

    var string = String(text[first...last])
    ...
    result = string
}
return result

So you're overwriting each time result with the last match.

A solution:

func parseCaptions(text: String) -> String {
    //NSRange, based on NSString use UTF16 for counting, while Swift.String use UTF8 by default, so `text.count` might be wrong
    let textRange = NSRange(location: 0, length: text.utf16.count)
    let regex = try! NSRegularExpression(pattern: "<text[^>]+>([^<]+)<\\/text>")
    let matches = regex.matches(in: text, range: textRange)

    var result: String = ""
    for match in matches {
        let textNSRange = match.range(at: 1)
        let textRange = Range(textNSRange, in: text)!
        var string = String(text[textRange])
        string = string.replacingOccurrences(of: "\n", with: " ")
        string = string.replacingOccurrences(of: "&#39;", with: "'")
        string = string.replacingOccurrences(of: "&amp;quot;", with: "\"")
        string.append("\n")
        result.append(string)
    }
    return result
}

So, with input:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<transcript>
<text start="9.462" dur="1.123">Aaaah</text>
<text start="70.507" dur="5.51">So guys, apparently we control Rewind this year.</text>
<text start="76.017" dur="4.842">
Y&#39;all we can do whatever we want. What do we do?
</text>
</transcript>

We get:

Aaaah
So guys, apparently we control Rewind this year.
 Y'all we can do whatever we want. What do we do? 
Larme
  • 24,190
  • 6
  • 51
  • 81