0

I am trying to read Non-Ascii characters from a csv in pysaprk, specifically the csv contains names of countries in spanish, so I have ESPAÑA (SPAIN in spanish), but it reads ESPA�OLA

this is the code I am using:

df = sqlContext.read.csv("path", sep=",", header=True ) 

I can't find all the encodings that sqlContext.read accepts, i was trying to use latin-1 but I get a message that it is not supported

Prathik Kini
  • 1,067
  • 11
  • 25
Joe
  • 561
  • 1
  • 9
  • 26

1 Answers1

1

Is there a way to convert your file to UTF-8 encoding before loading it with read.csv()?

Other possibly related question: How to parse CSV file with UTF-8 encoding?

Adhoc
  • 101
  • 9
  • I saw that post, but what I understand is that there is no option to read this characters with spark, so is that correct? I want to read them using spark (if it is possible) – Joe Jan 13 '20 at 19:43
  • Try to run with the parameter encoding="iso-8859" If it still says it's not supported, then your safest bet is to convert your file to UTF-8 first. This might help in that regard: https://stackoverflow.com/questions/191359/how-to-convert-a-file-to-utf-8-in-python – Adhoc Jan 13 '20 at 19:48
  • Yes that encoding says it is not supported, Thanks – Joe Jan 16 '20 at 13:21