0

I'm trying to read a CSV file, that I saved as UTF-8 encoded file. When I try to read the file with Pandas, it takes a lot of time but I get the desired output.

out_pd = pd.read_csv('../files/example_file_out.csv.gzip', sep='\t', encoding='utf-8', compression='gzip')

Doing almost the same in Spark to read exactly the same file from HDFS:

out_spark = spark.read.format('csv').options(header = "true", sep = "\t", encoding = "UTF-8").load("/Path/to/Folder/example_file_out.csv.gzip" )
out_spark.show()

With this result:

+-----------------------------------------------------------------------------------------------------+ |���_�example_file_out.csv.gzip�Ѳ�Fr$�|�l�A?��̈��L��F��cWZ�F��Ef�^�5C�k�hW���H$��j�xH�}N| +-----------------------------------------------------------------------------------------------------+ | @�#"<=<^�������...| | ?��ϟ���Ͽ��O�����...| | ރ����Y�^�x�o��e>Y...| +-----------------------------------------------------------------------------------------------------+

I really don't know what I'm doing wrong. Thanks in advance for your help!

user20382
  • 19
  • 2
  • 1
    try changing the file extension to `.gz`? – mck Jan 11 '21 at 20:15
  • Related: https://stackoverflow.com/a/49502965/480982 – Thomas Weller Jan 11 '21 at 20:15
  • ...or wait for the https://issues.apache.org/jira/browse/SPARK-29280 to be implemented. – mazaneicha Jan 12 '21 at 05:22
  • Thank you! That answered my question. As i mentioned below i've read some json.gzip files prior to the csv file and had no problems. Only the show() function takes for ever which is a little bit strange, because the json files are not as big as the csv. The latter works perfectly now. I'm now curious why the show() function takes that long, even though it may not belong in this topic. – user20382 Jan 12 '21 at 08:59

1 Answers1

0

Spark infers the file compression format using the file extension. By default gzipped files have the extension .gz, so if you rename your file to have an extension of .gz instead of .gzip, Spark should be able to decompress the csv file properly.

mck
  • 40,932
  • 13
  • 35
  • 50
  • Thank you, i will do that immeadiately. The only reason why I haven't tried that yet is because I've read other gzipped files without any problems. Or does it only apply to csv files, that it's better to use .gz instead of .gzip? – user20382 Jan 11 '21 at 20:28
  • I think it doesn't apply to csv files only - all other files should have the extension `.gz` for Spark to process properly. Not sure why other files worked for you. – mck Jan 11 '21 at 20:29