I'm trying to read a CSV file, that I saved as UTF-8 encoded file. When I try to read the file with Pandas, it takes a lot of time but I get the desired output.
out_pd = pd.read_csv('../files/example_file_out.csv.gzip', sep='\t', encoding='utf-8', compression='gzip')
Doing almost the same in Spark to read exactly the same file from HDFS:
out_spark = spark.read.format('csv').options(header = "true", sep = "\t", encoding = "UTF-8").load("/Path/to/Folder/example_file_out.csv.gzip" )
out_spark.show()
With this result:
+-----------------------------------------------------------------------------------------------------+ |���_�example_file_out.csv.gzip�Ѳ�Fr$�|�l�A?��̈��L��F��cWZ�F��Ef�^�5C�k�hW���H$��j�xH�}N| +-----------------------------------------------------------------------------------------------------+ | @�#"<=<^�������...| | ?��ϟ���Ͽ��O�����...| | ރ����Y�^�x�o��e>Y...| +-----------------------------------------------------------------------------------------------------+
I really don't know what I'm doing wrong. Thanks in advance for your help!