I want to use PySpark to efficiently remove Emoji (e.g., :-)
) from 1 billion records. How could I achieve this using pyspark syntax?
Asked
Active
Viewed 205 times
0

smci
- 32,567
- 20
- 113
- 146

william007
- 17,375
- 25
- 118
- 194
-
3Do you mean emoji or emoticons? Those are 2 different things – Ranoiaetep Jun 27 '20 at 06:45
-
3Also you should probably create a [mcve](https://stackoverflow.com/help/minimal-reproducible-example) , references [here](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – anky Jun 27 '20 at 07:41
-
This topic is super-interesting but your question way too broad, hence offtopic for SO. To make it on-topic for SO, can you fix it by adding example data and example code. Do you a) have a list of all the emojis you might encounter, or are you b) looking for a pretrained model that has a decent list, or c) do you want to learn them (hard, but doable)? (I've been working on this exact task recently, and I can tell you a) is manual, b) is seriously fallible, but c) is pretty hard) – smci Jun 28 '20 at 20:57