1

I noticed this very strange behavior, when trying to get the match for the <> tags

let s = "TEST \r\n\r\n<strong>more:</strong>"
let re = try! NSRegularExpression(pattern: "<.*?>")
let matches = re.matches(in: s, range: NSRange(location: 0, length: s.count))

This results only in 1 match (should have been 2 < strong > and </ strong >)

▿ 1 element
  - 0 : <NSSimpleRegularExpressionCheckingResult: 0x600003be3ac0>{9, 8}{<NSRegularExpression: 0x600002019080> <.*?> 0x0}

however when i remove the \r\n from the input checked text

let s = "TEST <strong>more:</strong>"

i get the expected 2 matches!!!

▿ 2 elements
  - 0 : <NSSimpleRegularExpressionCheckingResult: 0x600002e0ea00>{5, 8}{<NSRegularExpression: 0x6000035faaf0> <.*?> 0x0}
  - 1 : <NSSimpleRegularExpressionCheckingResult: 0x600002e0ed40>{18, 9}{<NSRegularExpression: 0x6000035faaf0> <.*?> 0x0}

What is going on?

Peter Lapisu
  • 19,915
  • 16
  • 123
  • 179
  • I think your backslash characters need escaping: `"TEST \\r\\n\\r\\nmore:"` – Philip Wrage Apr 27 '23 at 17:11
  • Does this help? https://stackoverflow.com/questions/28917893/swift-regular-expression-format – Philip Wrage Apr 27 '23 at 17:11
  • @Wrangle no it is not related also no reason for the \r\n to be eascaped – Peter Lapisu Apr 27 '23 at 17:30
  • 2
    for length use ```s.utf16.count``` – udi May 02 '23 at 08:08
  • @udi pls expand on that (caveeat of using count vs utf16.count) and make as answer, it looks working! – Peter Lapisu May 02 '23 at 08:16
  • 1
    `NSRegularExpression` is Objective-C based API, and use `NSString` which uses UTF16 for counting length. There are a few questions here on SO about that. I guess that if you add "someExtraText" at the end of your test String, it would have worked as expected. So either use `s.utf16.count` for the lenght of `NSRange` or use the `Swift.Range` -> `NSRange` methods. – Larme May 06 '23 at 12:06
  • 1
    See related questions: https://stackoverflow.com/questions/46495365/using-nsregularexpression-produces-incorrect-ranges-when-emoji-are-present and plenty of others if you look for "NSRegularExpression Swift utf16 count" – Larme May 06 '23 at 12:08

1 Answers1

1

The problem is due to the way String encodes the \r\n as a single Character:

let s2 = "\r\n"
print(s2.count)       // 1
print(s2.utf8.count)  // 2 

print(s2.utf8.map { String(format: "%02x", $0) }.joined() )   // “0d0a”

In your example there are 31 ASCII characters but each /r/n is encoded as a single Character:

let s = "TEST \r\n\r\n<strong>more:</strong>"
print(s.count)      // 29
print(s.utf8.count) // 31

The NSRange you calculate uses the Swift string length to specify a range in the NSString and is effectively removing the last two characters of the string when calculating the match. This can easily be confirmed by adding a two or more characters to the end of the string and seeing that two matches are returned.

String has a method for calculating an NSRange from an Range<String.Index> and when that is used then your example produces two matches:

let s = "TEST \r\n\r\n<strong>more:</strong>"
let re = try! NSRegularExpression(pattern: "<.*?>")
let range = NSRange(s.startIndex..., in: s)
let matches = re.matches(in: s, range: range)

You should probably move to the new Swift regular expression API rather than use the older bridged NSString and NSRegularExpression.

Geoff Hackworth
  • 2,673
  • 1
  • 16
  • 16