0

I am trying to read a file using spark.sparkContext.textFile. The file is in unicode encoded. when I read the file some of the chars are as below:

2851 K�RNYE HUNGARY

2851 K�RNYE HUNGARY

how to read a file to rdd be specifying encoding mode.

Community
  • 1
  • 1

1 Answers1

1

Using SparkContext.binaryFiles() should help. You just need to build the content specifying the relevant charset.

The example below is for ISO_8859:

val df = spark.sparkContext.binaryFiles(filePath, 12)
  .mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
  .toDF

More info here.

Oladipo
  • 1,579
  • 3
  • 17
  • 33
  • Thanks ! My source is a multi delimited file encoded with uTF-8. when i tried the above option i get the output as |_1 |_2 |file:/C:/Senthil/SenStudy/Scala/Files/multidelimiter.txt|input!~outout!~value!~count senthil!~nathan!~is!~here how to convert it to normal rows and columns [like normal dataframe] – senthilnathan May 07 '19 at 00:36
  • @senthilnathan this is probably what you need https://stackoverflow.com/a/46914895/4528242 – Oladipo May 07 '19 at 12:51