Why are my `binaryFiles` empty when I collect them in pyspark?

Question

I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.

I pass that to "binaryfiles" in pyspark:

zips = sc.binaryFiles('/user/path-to-folder-with-zips/')

I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:

zips_collected = zips.collect()

But, when I do that, it gives an empty list:

>> zips_collected
[]

I know that the zips are not empty - they have textfiles. The documentation here says

Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?

There can be more than one file per zip file, but the contents are always something like this:

rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data

It occurred to me that perhaps this is an issue with the *type* of zip file. How do I find out what type of zip file this is? — makansij, Jul 09 '16 at 00:59

score 1 · Answer 1 · edited May 23 '17 at 12:08

1

I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.

import io
import gzip

def zip_extract(x):
    """Extract *.gz file in memory for Spark"""
    file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
    return file_obj.read()

zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
                  .flatMap(lambda zip_file: zip_file.split("\n")) \
                  .map(lambda line: parse_line(line))
                  .collect()

edited May 23 '17 at 12:08

Community

1
1

answered Jul 07 '16 at 23:54

ramhiser

3,342
3
23
29

My code is exactly like that, almost verbatim. But it throws an error. That's why I thought I would just run "`collect`", just to see if there even is anything in the RDD. But it returns empty list? This is how I started debugging. – makansij Jul 08 '16 at 17:03
How many text files are in each zip file -- 1 or more? Also, could you post a small sample of zip files? If the data are proprietary, that's fine -- use artificial data instead. – ramhiser Jul 08 '16 at 18:55
Sure, it's just a bunch of psv files. Each files is on the order of 10's of MB – makansij Jul 08 '16 at 22:14
Also, your example is for `gzip` files. I'm using zip files – makansij Jul 09 '16 at 01:04

Why are my `binaryFiles` empty when I collect them in pyspark?

1 Answers1

Linked