0

I am trying to read utf-8 encoding file into Spark Scala. I am doing this

val nodes = sparkContext.textFile("nodes.csv")

where the given csv file is in UTF-8, but spark converts non-english characters to ? How do I get it to read actual values? I tried it in pyspark and it works fine because pyspark's textFile() function has encoding option and by default support utf-8 (it seems).

I am sure the file is in utf-8 encoding. I did this to confirm

➜  workspace git:(f/playground) ✗ file -I nodes.csv
nodes.csv: text/plain; charset=utf-8

1 Answers1

1

Using this post, we can read the file first then feed it to the sparkContext

val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
val rdd = sc.parallelize(Source.fromFile(filename)(decoder).getLines().toList)
joel
  • 6,359
  • 2
  • 30
  • 55