2

I'm trying to use a regex to capture tweets containing the substring at least twice, so I'm using an unsophisticated ^.+ .+ .+$. However this doesn't match strings which instead contain, for example, .

Is there a smart way I can capture an emoji with any or none skin-tone variation, without just putting each one in a row (like [])?

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
Cai
  • 1,726
  • 2
  • 15
  • 24
  • Are they Unicode Characters? – Kaspar Lee Mar 31 '16 at 11:05
  • How are these emoji represented? Unicode? If yes, what is their value? – dambros Mar 31 '16 at 11:08
  • Ok, this question *definitely* requires the regex flavor to be known - what language/regex lib are you using? – Lucas Trzesniewski Mar 31 '16 at 11:19
  • @Druzion Ah, yes thank you, this was a prompt I needed to probe a little further into how emojis are represented on twitter. I've now figured an answer so can share it below. – Cai Mar 31 '16 at 11:23
  • @LucasTrzesniewski, yikes, this is a question I don't fully understand, sorry. However, I've now found what I was looking for, so I'll include it in the answer below. – Cai Mar 31 '16 at 11:24
  • 1
    @Cai what I meant is: the answer will be very different, depending on the regex engine that you'll use. PCRE/.NET/Python/Java/JavaScript/etc... I could just tell you to use `(?=)\X` but that wouldn't work in several of these. – Lucas Trzesniewski Mar 31 '16 at 11:26
  • 1
    @LucasTrzesniewski Ah, ok, got it. Thanks. As it happened I was just using Textmate, which claims to use the Oniguruma regex library, which likely explains why your suggestion didn't work for me :) – Cai Mar 31 '16 at 11:41

1 Answers1

5

Thanks to comments above, I've found that emojis I've encountered on twitter are unicode, and skin-tone variations are combining characters in the range 1f3fb1f3ff.

http://unicode.org/reports/tr51/#Emoji_Modifiers_Table

So for me what I wanted was [\x{1f3fb}-\x{1f3ff}]?, with [\x{1f3fb}-\x{1f3ff}]? being something I can then drop next to any unmodified emoji to include skin-tone variations.

Cai
  • 1,726
  • 2
  • 15
  • 24