I have a spark dataframe with location column, which contains names of countries. I need to convert these names to ISO3 code. I know that there is a python library country-converter, but I don't know how to apply it so it converts values of only one column in my dataframe.
Example dataframe:
UserName | Location |
---|---|
adam | United States |
anastasia | United Kingdom |
I am able to convert country names from the column to codes by transforming them to RDD and after converting again to DF:
import country_converter as coco
out_format = "ISO3"
countries = df.select("Location").rdd.collect()
countries = coco.convert(names=countries, to=out_format, not_found=None)
countriesDF = spark.createDataFrame(countries, StringType())
Output:
value |
---|
USA |
GBR |
However I have two problems with this code:
- As a result I create completely different dataframe and I loose information about UserName. I need output to be like this:
Expected output:
UserName | Location |
---|---|
adam | USA |
anastasia | GBR |
- Some of the results are like this: Row(Countries='London, UK'), how can I get rid of this? I use the below code but I wonder if there is faster way than doing it manually for every Row:
countriesDF.replace({"Row(Countries='London, UK')" : "GBR"})