I need to load a pure txt RDD in spark. But for some reasons, the filename of the file to be loaded must be named as "xxx.gz". This file, by default, is recognized as a gz file when using sc.textFile. How can I tell spark to recognize the file as a pure txt file?
Asked
Active
Viewed 354 times
-8
-
1That's not a text file. `gz` is the extension for [GZip packages](https://www.gzip.org/). That GZip package may contain one or more text files – Panagiotis Kanavos Jun 24 '19 at 10:41
-
1Possible duplicate of [Read from a gzip file in python](https://stackoverflow.com/questions/12902540/read-from-a-gzip-file-in-python) – bharatk Jun 24 '19 at 10:43
-
This is a spark problem, which is mentioned in the tag. – hengyue li Jun 24 '19 at 11:34
-
1Possible duplicate of [How to read gz compressed file by pyspark](https://stackoverflow.com/questions/42761912/how-to-read-gz-compressed-file-by-pyspark) – user10938362 Jun 24 '19 at 14:37
1 Answers
0
You can use gzip.
gzip.open(filename, mode='rb', compresslevel=9, encoding=None, errors=None, newline=None)

GodlyBuTcheR
- 67
- 5