0

I want to use PySpark to efficiently remove Emoji (e.g., :-)) from 1 billion records. How could I achieve this using pyspark syntax?

smci
  • 32,567
  • 20
  • 113
  • 146
william007
  • 17,375
  • 25
  • 118
  • 194
  • 3
    Do you mean emoji or emoticons? Those are 2 different things – Ranoiaetep Jun 27 '20 at 06:45
  • 3
    Also you should probably create a [mcve](https://stackoverflow.com/help/minimal-reproducible-example) , references [here](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – anky Jun 27 '20 at 07:41
  • This topic is super-interesting but your question way too broad, hence offtopic for SO. To make it on-topic for SO, can you fix it by adding example data and example code. Do you a) have a list of all the emojis you might encounter, or are you b) looking for a pretrained model that has a decent list, or c) do you want to learn them (hard, but doable)? (I've been working on this exact task recently, and I can tell you a) is manual, b) is seriously fallible, but c) is pretty hard) – smci Jun 28 '20 at 20:57

1 Answers1

0

use regexp_replace pyspark function

Hossein Torabi
  • 694
  • 1
  • 7
  • 18