1

I have a set of gzip-compressed CSV files in S3. But they have the .csv extension, not .csv.gz. The issue is that when I try to read them using Pyspark, they do not read properly. I have tried many configurations, but with no luck.

enter image description here

Then I found similar issue in here(link). But here they have used Scala. I tried to implement this with Python, but I could not find the correct APIs for doing that.

enter image description here

Any help would be appreciated.

Implement Python code for read a compressed file with custom extension using Pyspark.

Shanga
  • 300
  • 1
  • 8
  • Does this answer your question? [How to read gz compressed file by pyspark](https://stackoverflow.com/questions/42761912/how-to-read-gz-compressed-file-by-pyspark) – Pravash Panigrahi Apr 27 '23 at 17:25
  • @pravashpanigrahi No that not the problem here. In this mentioned issue the file extension is correct(.gz). But in my case file format is incorrect(see it has .csv extension instead of .csv.gz). So, I need develop custom compression codec. As I showed in the question, it is possible in java. Now I need to implement it using Python. But I can't find correct API for that. Anyway, I appreciate your help. :) – Shanga Apr 29 '23 at 16:21
  • Maybe I misunderstood, but I think it can be done with: `codec = "org.apache.hadoop.io.compress.GzipCodec"`. – Memristor May 04 '23 at 22:08

0 Answers0