(PY)Spark: How to read a ".txt" file with extension name ".gz"

Question

I need to load a pure txt RDD in spark. But for some reasons, the filename of the file to be loaded must be named as "xxx.gz". This file, by default, is recognized as a gz file when using sc.textFile. How can I tell spark to recognize the file as a pure txt file?

That's not a text file. `gz` is the extension for [GZip packages](https://www.gzip.org/). That GZip package may contain one or more text files — Panagiotis Kanavos, Jun 24 '19 at 10:41
Possible duplicate of [Read from a gzip file in python](https://stackoverflow.com/questions/12902540/read-from-a-gzip-file-in-python) — bharatk, Jun 24 '19 at 10:43
Possible duplicate of [How to read gz compressed file by pyspark](https://stackoverflow.com/questions/42761912/how-to-read-gz-compressed-file-by-pyspark) — user10938362, Jun 24 '19 at 14:37

score 0 · Answer 1 · answered Jun 24 '19 at 10:45

0

You can use gzip.

gzip.open(filename, mode='rb', compresslevel=9, encoding=None, errors=None, newline=None)

answered Jun 24 '19 at 10:45

GodlyBuTcheR

67
5

(PY)Spark: How to read a ".txt" file with extension name ".gz"

1 Answers1