2

I made a Mastodon / Twitter <--> IRC bot a while back. It's been working great, but someone complained that when people use emojis on mastodon (which seems to happen a lot in some usernames ..) it breaks his terminal.

I was wondering if there is a way to remove those from the ByteStrings before sending them to IRC (or at least provide an option to do so), googling a bit I found this : removing emojis from a string in Python

Looks like \U0001F600-\U0001F64F should be the emoji range if I understand it correctly, but I've never been big with regex. Any easy-ish way to translate that to Haskell ? I've tried reading up a bit on regex but I only get "lexical error in string/character literal at character 'U'" when I try, I assume that syntax must be a python thing.

Thanks

Ulrar
  • 895
  • 8
  • 17

2 Answers2

3

Unicode characters are represented by a single backslash, followed by an optional x for hexadecimal, o for octal and none for decimal number representing the character [0]:

putStrLn "\x1f600" -- 

Here, \x is a prefix for the hexadecimal representation of the first emoji character in Unicode.

You can now remove the emojis using RegExp or you could simply do:

emojis = concat [['\x1f600'..'\x1F64F'],
                 ['\x1f300'..'\x1f5ff'],
                 ['\x1f680'..'\x1f6ff'],
                 ['\x1f1e0'..'\x1f1ff']]
someString = "hello "
removeEmojis = filter (`notElem` emojis)

putStrLn . removeEmojis $ someString -- "hello "

[0] Haskell Language 2010: Lexical Structure#Character and String Literals

Mahdi Dibaiee
  • 885
  • 1
  • 7
  • 20
  • Thanks, that does work ! I went with the other answer 'cause it's lighter I think, but both achieve what I was looking for – Ulrar Sep 23 '17 at 16:21
2

Not a emoji or unicode expert, but this seems to work:

isEmoji :: Char -> Bool
isEmoji c = let uc = fromEnum c
            in uc >= 0x1F600 && uc <= 0x1F64F

str = "wew"

As Daniel Wagner points out, this can be made even better:

isEmoji :: Char -> Bool
isEmoji c = c >= '\x1F600' && c <= '\x1F64F'

Demo in ghci:

λ> str
"\128513wew\128513"
λ> filter isEmoji str
"\128513\128513"
λ> filter (not . isEmoji) str
"wew"

Explanation: fromEnum function converts the character to the corresponding Int value defined by the Unicode. I just check for the unicode range of emoji in the function to determine if it's actually an emoji.

Sibi
  • 47,472
  • 16
  • 95
  • 163