spark.sparkContext.textFile read a file using UTF-8 encoding

Question

I am trying to read a file using spark.sparkContext.textFile. The file is in unicode encoded. when I read the file some of the chars are as below:

2851 K�RNYE HUNGARY

2851 K�RNYE HUNGARY

how to read a file to rdd be specifying encoding mode.

score 1 · Answer 1 · answered May 06 '19 at 12:52

1

Using SparkContext.binaryFiles() should help. You just need to build the content specifying the relevant charset.

The example below is for ISO_8859:

val df = spark.sparkContext.binaryFiles(filePath, 12)
  .mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
  .toDF

More info here.

answered May 06 '19 at 12:52

Oladipo

Thanks ! My source is a multi delimited file encoded with uTF-8. when i tried the above option i get the output as |_1 |_2 |file:/C:/Senthil/SenStudy/Scala/Files/multidelimiter.txt|input!~outout!~value!~count senthil!~nathan!~is!~here how to convert it to normal rows and columns [like normal dataframe] – senthilnathan May 07 '19 at 00:36
@senthilnathan this is probably what you need https://stackoverflow.com/a/46914895/4528242 – Oladipo May 07 '19 at 12:51

1 Answers1