I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/
.
I pass that to "binaryfiles" in pyspark:
zips = sc.binaryFiles('/user/path-to-folder-with-zips/')
I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:
zips_collected = zips.collect()
But, when I do that, it gives an empty list:
>> zips_collected
[]
I know that the zips are not empty - they have textfiles. The documentation here says
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?
There can be more than one file per zip file, but the contents are always something like this:
rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data