Strange characters while reading gzipped CSV file

Question

I'm trying to read a CSV file, that I saved as UTF-8 encoded file. When I try to read the file with Pandas, it takes a lot of time but I get the desired output.

out_pd = pd.read_csv('../files/example_file_out.csv.gzip', sep='\t', encoding='utf-8', compression='gzip')

Doing almost the same in Spark to read exactly the same file from HDFS:

out_spark = spark.read.format('csv').options(header = "true", sep = "\t", encoding = "UTF-8").load("/Path/to/Folder/example_file_out.csv.gzip" )
out_spark.show()

With this result:

+-----------------------------------------------------------------------------------------------------+ |��_�example_file_out.csv.gzip�Ѳ�Fr$�|�l�A?��̈��L��F��cWZ�F��Ef�^�5C�k�hW��H$��j�xH�}N| +-----------------------------------------------------------------------------------------------------+ | @�#"<=<^��...| | ?��ϟ��Ͽ��O��...| | ރ��Y�^�x�o��e>Y...| +-----------------------------------------------------------------------------------------------------+

I really don't know what I'm doing wrong. Thanks in advance for your help!

...or wait for the https://issues.apache.org/jira/browse/SPARK-29280 to be implemented. — mazaneicha, Jan 12 '21 at 05:22
Thank you! That answered my question. As i mentioned below i've read some json.gzip files prior to the csv file and had no problems. Only the show() function takes for ever which is a little bit strange, because the json files are not as big as the csv. The latter works perfectly now. I'm now curious why the show() function takes that long, even though it may not belong in this topic. — user20382, Jan 12 '21 at 08:59

score 0 · Accepted Answer · answered Jan 11 '21 at 20:18

0

Spark infers the file compression format using the file extension. By default gzipped files have the extension .gz, so if you rename your file to have an extension of .gz instead of .gzip, Spark should be able to decompress the csv file properly.

answered Jan 11 '21 at 20:18

mck

40,932
13
35
50

Thank you, i will do that immeadiately. The only reason why I haven't tried that yet is because I've read other gzipped files without any problems. Or does it only apply to csv files, that it's better to use .gz instead of .gzip? – user20382 Jan 11 '21 at 20:28
I think it doesn't apply to csv files only - all other files should have the extension `.gz` for Spark to process properly. Not sure why other files worked for you. – mck Jan 11 '21 at 20:29

Strange characters while reading gzipped CSV file

1 Answers1