-8

I need to load a pure txt RDD in spark. But for some reasons, the filename of the file to be loaded must be named as "xxx.gz". This file, by default, is recognized as a gz file when using sc.textFile. How can I tell spark to recognize the file as a pure txt file?

hengyue li
  • 448
  • 3
  • 17
  • 1
    That's not a text file. `gz` is the extension for [GZip packages](https://www.gzip.org/). That GZip package may contain one or more text files – Panagiotis Kanavos Jun 24 '19 at 10:41
  • 1
    Possible duplicate of [Read from a gzip file in python](https://stackoverflow.com/questions/12902540/read-from-a-gzip-file-in-python) – bharatk Jun 24 '19 at 10:43
  • This is a spark problem, which is mentioned in the tag. – hengyue li Jun 24 '19 at 11:34
  • 1
    Possible duplicate of [How to read gz compressed file by pyspark](https://stackoverflow.com/questions/42761912/how-to-read-gz-compressed-file-by-pyspark) – user10938362 Jun 24 '19 at 14:37

1 Answers1

0

You can use gzip.

gzip.open(filename, mode='rb', compresslevel=9, encoding=None, errors=None, newline=None)