In my AWS cluster, I have downloaded a Python package:
python3 -m pip install Unidecode
Now, I want to use this on my pyspark dataframe column named 'city' which take values like: 'são paulo', 'seropédica' etc (i.e with accents) and want to create a new column named 'city_no_accents' which will correct all accents from the text and make it in normal English text like 'sao paulo', 'seropedica' etc.
So, I wrote below PySpark code:
<...imported some other packages>
from unidecode import unidecode
def remove_accents(data):
return unidecode(data)
if __name__ == '__main__':
#create spark session
spark = SparkSession.sparkSession("GetData")
sc = spark.getSparkSession()
logging.info("Spark Session initiated")
sm = sparkManager.sparkManager(sc)
remove_accents_udf = udf(remove_accents)
city_df_with_accents = city_df['city']
city_df_without_accents = city_df_with_accents.withColumn('city_no_accents', remove_accents_udf('city'))
city_df_without_accents.show(5)
Last line in above code is giving me below ERROR:
File "/usr/lib/spark/python/pyspark/serializers.py", line 580, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'unidecode'
But if in place of a dataframe column, I take a string variable then it is working fine. For Example:
x = 'são paulo'
remove_accents_udf(x)
OUTPUT: 'sao paulo'
So, is there a way by which I could convert all the rows of a particular dataframe column (i.e 'city') into plain text?
PySpark ==> version 2.4.4
Python ==> version 3.6.8