0

I have tweets with emoji like , not emoji code like U1F602. I found other questions and answers on StackOverflow and it didn't help to remove this emojis. My dataframe in Scala has following fields:

  • id (string)
  • tweets (string)
  • labels (string)

Here is a sample tuple:

id               tweets                              labels
2017-En-21193    Big boss is waiting #panic       fear

Expected Result:

id               tweets                              labels
2017-En-21193    Big boss is waiting #panic          fear
Abu Shoeb
  • 4,747
  • 2
  • 40
  • 45
  • Why you hating on emojis ? Is it all emojis? – ctwheels Jan 24 '18 at 21:04
  • Possible duplicate of [What is the regex to extract all the emojis from a string?](https://stackoverflow.com/questions/24840667/what-is-the-regex-to-extract-all-the-emojis-from-a-string) – wp78de Jan 24 '18 at 21:09
  • I don't hate emoji. All I want to have plain text in tweets so I'm removing them. – Abu Shoeb Jan 24 '18 at 21:09

2 Answers2

2

This can be done using regex in Scala. One way is to find emoji and remove it. Another way is to get rid of all unnecessary characters from tweets except Alphanumeric and Punctuations.

One Way (just remove all emojis you want)

import org.apache.spark.sql.functions.not
val newDf = oldDf.withColumn("tweets", regexp_replace(oldDf("tweets"), """[]""", ""))

Another Way (remove everything except Alphanumeric and Punctuations)

import org.apache.spark.sql.functions.not
val newDf = oldDf.withColumn("tweets", regexp_replace(oldDf("tweets"), """[^ 'a-zA-Z0-9,.?!]""", ""))
Abu Shoeb
  • 4,747
  • 2
  • 40
  • 45
  • Oh yes, I should have added # to keep my hashtags, thanks – Abu Shoeb Jan 24 '18 at 21:02
  • 1
    This answer doesn't provide a means of removing emoijis (without clobbering a lot of non-emojis too). Showing how to remove the characters in the Emoticons block (and only those) would be a good start. In Perl, one can use `\p{Block: Emoticons}`. If that's not available, you should be able to use a character range. The block in question is (currently) 1F600..1F64F. ([Unicode blocks](ftp://www.unicode.org/Public/UNIDATA/Blocks.txt)) – ikegami Jan 24 '18 at 21:12
  • Do we have the same thing in Scala? – Abu Shoeb Jan 24 '18 at 21:12
  • In the [Java regex docs](https://docs.oracle.com/javase/tutorial/essential/regex/unicode.html#properties) there's info that is relevant for Scala. Try `\p{block=Emoticons}`. – Rich Dougherty Jan 25 '18 at 02:11
1

You can use a regular expression with a block to filter emojis from your string

For example:

"""\P{block=Emoticons}""".r.findAllIn("Big boss is waiting #panic ").mkString.trim
darrenmc
  • 1,721
  • 1
  • 19
  • 29
  • Looks like in Spark 2.4.0 this is not filtering out all Emojis. Is "block=Emoticons" a static definition? Or does it pull from a list that is updated regularly? – Steve Gon Mar 18 '19 at 23:43
  • 1
    It is statically defined libs for the version of the JVM you are running. The `Emoticons` block should be available from 1.7 onwards with more supported unicode blocks added in 1.8 and 1.9 – darrenmc Mar 20 '19 at 17:45
  • Thanks darrenmc, I had to go to a static pattern typed in with all the \u2xxx codes listed. – Steve Gon Mar 20 '19 at 23:34
  • 1
    You can also define a range in your regex. For example I found that `block=Supplemental Symbols and Pictographs` was not supported the JVM version we are running and so used this pattern `"""[\uD83E\uDD00-\uD83E\uDDFF]"""` – darrenmc Mar 21 '19 at 14:28