2

I've been trying to find a good way to be able to keep only emojis and letters in a given text, but every article I found, I didn't have success with . I've tried to use regex, but seems that I can not make it work. I've tried to use emoji4j but it seems that this library is working with emojis in this form ":)", which don't help me, because my emojis are groups of unicode characters.

The result I want is the following :

"This is. a text ‍‍‍,,1234" => "This is a text ‍‍‍"
"‍‍‍" => "‍‍‍"
"‍‍‍123abc‍‍‍" => "‍‍‍abc‍‍‍"

Here's the emoji regex : ?:[\u2700-\u27bf]|(?:[\ud83c\udde6-\ud83c\uddff]){2}|[\ud800\udc00-\uDBFF\uDFFF]|[\u2600-\u26FF])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|[\ud83c\udffb-\ud83c\udfff])?(?:\u200d(?:[^\ud800-\udfff]|(?:[\ud83c\udde6-\ud83c\uddff]){2}|[\ud800\udc00-\uDBFF\uDFFF]|[\u2600-\u26FF])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|[\ud83c\udffb-\ud83c\udfff])?)*|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|[\ud83c\udd70-\ud83c\udd71]|[\ud83c\udd7e-\ud83c\udd7f]|\ud83c\udd8e|[\ud83c\udd91-\ud83c\udd9a]|[\ud83c\udde6-\ud83c\uddff]|[\ud83c\ude01-\ud83c\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c\ude32-\ud83c\ude3a]|[\ud83c\ude50-\ud83c\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff] .

If I try something like :

val regex = "the_whole_regex_above | [^a-zA-Z]".toRegex() myText.replace(regex,""), it won't replace anything, basically every character will pass

Basically I want to achieve pretty much the same thing as in this question, but using Kotlin.

biafas
  • 127
  • 7
  • for example : "This is. a text ‍‍‍,,1234" . it will return the same text. ("This is. a text ‍‍‍,,1234") – biafas Jun 09 '20 at 09:34
  • 1
    I feel that all you need is to remove all punctuation, symbols (other than those used to form emojis) and digits, right? Try `myText.replace("""[\p{N}\p{P}\p{S}&&[^\p{So}]]+""".toRegex(), "")` – Wiktor Stribiżew Jun 09 '20 at 09:44
  • I'm sorry for unclarities . so what I want is : "This is. a text ‍‍‍,,1234" to return "This is a text ‍‍‍" – biafas Jun 09 '20 at 10:14
  • 2
    See https://ideone.com/koXAWG – Wiktor Stribiżew Jun 09 '20 at 10:18
  • @WiktorStribiżew your answer is right. Works as expected. – biafas Jun 09 '20 at 10:36
  • `keep only emojis and letters` yes ? i give yuo this if wanter. note these other links and answer downt desrcibe emoji which is a complex regex. let me know –  Jun 09 '20 at 19:16

1 Answers1

2

You want to remove all punctuation, symbols (other than those used to form emojis) and digits.

To do that, you may use

myText = myText.replace("""[\p{N}\p{P}\p{S}&&[^\p{So}]]+""".toRegex(), "")

See the online Kotlin demo.

Details

  • [ - start of a character class that matches:
    • \p{N} - any Unicode digit
    • \p{P} - any Unicode punctuation proper
    • \p{S} - any Unicode symbol
    • &&[^\p{So}] - BUT the Unicode symbols belonging to Symbol, other Unicode category that are mostly used to form emojis
  • ]+ - 1 or more occurrences.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • there's a problem with this: symbols like ® pass the filter. – biafas Jun 09 '20 at 11:17
  • @biafas Correct, that char is in the [Symbol, Other](https://www.compart.com/en/unicode/category/So) category. There are some chars in that category that are not used to form emojis. You may add those frequent ones to the regex, `[\p{N}\p{P}\p{S}¦©®°҂&&[^\p{So}]]+`. The precise solution is to use a huge regex encompassing the whole emoji list. – Wiktor Stribiżew Jun 09 '20 at 11:31
  • isn't there a way to use that huge regex I posted in the original question? – biafas Jun 09 '20 at 11:37
  • @biafas Yes, do you just want to use that one to match only those emojis it matches? It does not match all of them. – Wiktor Stribiżew Jun 09 '20 at 11:45
  • well I'd prefer to use a regex that matches all emojis. I thought this is the right one – biafas Jun 09 '20 at 11:47
  • @biafas [This one does](https://stackoverflow.com/a/48148218/3832970), but it requires an update that I can't make for the time being. – Wiktor Stribiżew Jun 09 '20 at 11:47
  • how will this work in conjuction with only letters? Thats the biggest problem I'm facing. I don't know how to combine this long Regex expression with allowing only letters – biafas Jun 09 '20 at 11:54
  • @biafas Whatever pattern with Unicode escape sequences I try to match an emoji there is no match. If you know how to make your pattern match an emoji, then all you need is `myText.replace("((?:\\p{L}\\p{M}*+|$emojiPattern)+)|\\S".toRegex(), "$1")` – Wiktor Stribiżew Jun 09 '20 at 13:56