I am trying to read a file using spark.sparkContext.textFile
. The file is in unicode encoded. when I read the file some of the chars are as below:
2851 K�RNYE HUNGARY
2851 K�RNYE HUNGARY
how to read a file to rdd be specifying encoding mode.
I am trying to read a file using spark.sparkContext.textFile
. The file is in unicode encoded. when I read the file some of the chars are as below:
2851 K�RNYE HUNGARY
2851 K�RNYE HUNGARY
how to read a file to rdd be specifying encoding mode.
Using SparkContext.binaryFiles()
should help. You just need to build the content specifying the relevant charset.
The example below is for ISO_8859:
val df = spark.sparkContext.binaryFiles(filePath, 12)
.mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
.toDF
More info here.