0

In my AWS cluster, I have downloaded a Python package:

  python3 -m pip install Unidecode

Now, I want to use this on my pyspark dataframe column named 'city' which take values like: 'são paulo', 'seropédica' etc (i.e with accents) and want to create a new column named 'city_no_accents' which will correct all accents from the text and make it in normal English text like 'sao paulo', 'seropedica' etc.

So, I wrote below PySpark code:

<...imported some other packages>
from unidecode import unidecode

def remove_accents(data):
    return unidecode(data)

if __name__ == '__main__':
    #create spark session
    spark = SparkSession.sparkSession("GetData")
    sc = spark.getSparkSession()
    logging.info("Spark Session initiated")
    sm = sparkManager.sparkManager(sc)
    remove_accents_udf = udf(remove_accents)

city_df_with_accents = city_df['city'] 

city_df_without_accents = city_df_with_accents.withColumn('city_no_accents', remove_accents_udf('city'))

city_df_without_accents.show(5)

Last line in above code is giving me below ERROR:

File "/usr/lib/spark/python/pyspark/serializers.py", line 580, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'unidecode'

But if in place of a dataframe column, I take a string variable then it is working fine. For Example:

x = 'são paulo'
remove_accents_udf(x)

OUTPUT: 'sao paulo'

So, is there a way by which I could convert all the rows of a particular dataframe column (i.e 'city') into plain text?

PySpark ==> version 2.4.4

Python ==> version 3.6.8

Bhuvi007
  • 111
  • 1
  • 3
  • 11
  • Does this answer your question? [What is the best way to remove accents with Apache Spark dataframes in PySpark?](https://stackoverflow.com/questions/38359534/what-is-the-best-way-to-remove-accents-with-apache-spark-dataframes-in-pyspark) – user40929 Oct 15 '20 at 14:36
  • No, again I will be stuck with the same issue so it won't work. I have posted a solution below. – Bhuvi007 Oct 16 '20 at 05:34

1 Answers1

0

I found 1 solution (might not be an optimal one though)

First convert the PySpark Data Frame into Pandas Dataframe:

import pandas as pd
from pyspark.sql.types import * 

city_geolocation_mappings_results_df_pd = city_geolocation_mappings_results_df.toPandas()

Then make use of this ticket: How to replace accents in a column of a pandas dataframe

Then convert back Pandas DF to PySpark DF.

Bhuvi007
  • 111
  • 1
  • 3
  • 11