How to read ñ with pyspark from a csv

Question

I am trying to read Non-Ascii characters from a csv in pysaprk, specifically the csv contains names of countries in spanish, so I have ESPAÑA (SPAIN in spanish), but it reads ESPA�OLA

this is the code I am using:

df = sqlContext.read.csv("path", sep=",", header=True )

I can't find all the encodings that sqlContext.read accepts, i was trying to use latin-1 but I get a message that it is not supported

score 1 · Answer 1 · answered Jan 13 '20 at 19:37

1

Is there a way to convert your file to UTF-8 encoding before loading it with read.csv()?

Other possibly related question: How to parse CSV file with UTF-8 encoding?

answered Jan 13 '20 at 19:37

Adhoc

101
9

I saw that post, but what I understand is that there is no option to read this characters with spark, so is that correct? I want to read them using spark (if it is possible) – Joe Jan 13 '20 at 19:43
Try to run with the parameter encoding="iso-8859" If it still says it's not supported, then your safest bet is to convert your file to UTF-8 first. This might help in that regard: https://stackoverflow.com/questions/191359/how-to-convert-a-file-to-utf-8-in-python – Adhoc Jan 13 '20 at 19:48
Yes that encoding says it is not supported, Thanks – Joe Jan 16 '20 at 13:21

How to read ñ with pyspark from a csv

1 Answers1