2

I am trying to find the range of specific substrings of a string. Each substring begins with a hashtag and can have any character it likes within it (including emojis). Duplicate hashtags should be detected at distinct ranges. A kind user from here suggested this code:

var str = "The range of #hashtag should be different to this #hashtag"
let regex = try NSRegularExpression(pattern: "(#[A-Za-z0-9]*)", options: [])
let matches = regex.matchesInString(str, options:[], range:NSMakeRange(0, str.characters.count))
for match in matches {
    print("match = \(match.range)")
}

However, this code does not work for emojis. What would be the regex expression to include emojis? Is there a way to detect a #, followed by any character up until a space/line break?

Tometoyou
  • 7,792
  • 12
  • 62
  • 108

1 Answers1

13

Similarly as in Swift extract regex matches, you have to pass an NSRange to the match functions, and the returned ranges are NSRanges as well. This can be achieved by converting the given text to an NSString.

The #\S+ pattern matches a # followed by one or more non-whitespace characters.

let text = "The range of #hashtag should  be  different to this #hashtag"

let nsText = text as NSString
let regex = try NSRegularExpression(pattern: "#\\S+", options: [])
for match in regex.matchesInString(text, options: [], range: NSRange(location: 0, length: nsText.length)) {
    print(match.range)
    print(nsText.substringWithRange(match.range))
}

Output:

(15,10)
#hashtag
(62,10)
#hashtag

You can also convert between NSRange and Range<String.Index> using the methods from NSRange to Range<String.Index>.

Remark: As @WiktorStribiżew correctly noticed, the above pattern will include trailing punctuation (commas, periods, etc). If that is not desired then

let regex = try NSRegularExpression(pattern: "#[^[:punct:][:space:]]+", options: [])

would be an alternative.

Community
  • 1
  • 1
Martin R
  • 529,903
  • 94
  • 1,240
  • 1,382
  • 1
    Note that `#\\S+` will also match punctuation at the end of the hashtag. – Wiktor Stribiżew Sep 26 '16 at 11:28
  • @WiktorStribiżew: You are right, but the question was about how *"to detect a #, followed by any character up until a space/line break"*, and OP stated that hashtags can contain *arbitrary* characters. – Martin R Sep 26 '16 at 11:29
  • Actually it might be a good idea just to have it as a hashtag followed by only letters, numbers, and emojis ... For bonus points could you give the regex that would be for that? – Tometoyou Sep 26 '16 at 11:43
  • 1
    @Tometoyou: If you search for "regex emojis" then you should find some approaches, I do not have an immediate solution, also the set of Emoji characters grows with each Unicode release. – I have updated the answer with a simpler solution which excludes punctuation. – Martin R Sep 26 '16 at 11:46
  • Ok thanks! One more thing, I've just noticed that the range for your answer should be (14, 9) and (56, 9). Is there anyway to make it count emojis as 1 character? – Tometoyou Sep 26 '16 at 11:50
  • 1
    @Tometoyou: Emojis count as two characters in an `NSString` and that is what `NSRange` refers to. You can convert the NSRange back to a Swift `String` range using the `rangeFromNSRange` method at http://stackoverflow.com/a/30404532/1187415 that I linked to. – Martin R Sep 26 '16 at 11:52
  • RegEx fails when multiple #tags together. the text `Some text #tagg#gshd some more text` matches only 1 tag `#tagg#gshd`, @MartinR is there a workaround to avoid this, thanks – AamirR Jun 07 '17 at 16:23
  • @Aamir: Did you try the other pattern `"#[^[:punct:][:space:]]+"` ? – Martin R Jun 07 '17 at 18:19
  • Oh just noticed, will try in an hour and let you know, thanks – AamirR Jun 07 '17 at 18:21
  • @MartinR The other pattern works perfectly, thanks for pointing out – AamirR Jun 07 '17 at 20:11