Filtering files using specific pattern when reading tar.gz archive in Pyspark only using dataframe

Question

This is in continuation with this post. Here is the statement. I have multiple CSV files in my folder myfolder.tar.gz. Which I created in this way: first put all my files in a folder name myfolder then prepare a tar folder of it. Then prepare .gz of that tar folder.

Let us say we have 5 files.

abc_1.csv
abc_2.csv
abc_3.csv
def_1.csv
def_2.csv

I want to filter read files in a specific filename pattern using only Pyspark data frame. Like we want to read all abc files together. This should not give us the results from def and vice versa.

The solution in this post by @blackbishop uses an rdd to extract the files and then converting it to a dataframe. This is working perfectly fine but issue is with performace for huge files. Is there any way that we can only do the same with use of pyspark dataframe reader. We have to only use dataframe and no rdd. Can we do the same?

just by curiosity, why do you create `myfolder.tar.gz` with multiple CSV inside if you wanted to read them separately later? I'm afraid you'll have to uncompress them if you want to use dataframe API or change the process of generating them to have one gzip per single CSV file or at least regroup them by categories that can be loaded together without filtering. — blackbishop, Feb 14 '21 at 12:44
Hi blackbishop, You are right. But that is the requirement. The benefit of compressing them inside tar.gz is that it saves a lot of space. Since the requirement is to read them by filtering, it would be optimal if we do the same by saving the space as well as time. By compressing them we save our space. Hence now we want to read it optimally using dataframe. — supernova, Feb 14 '21 at 13:03
The problem with DataFrame API is that it knows only the `tar.gz` filename so you can't filter it. Any columns in the CSV files to use in place of filtering by filename? — blackbishop, Feb 14 '21 at 13:09
Yes. I also think that dataframe will consider it as a single file, hence filtering is not possible only using dataframe. No, I dont want any specific columns. Just the requirement was to filter using specific file names. But yes, I think we can't do as we need to provide path to the file which does not exist as tar.gz acts as a single source. Thanks for the answer. — supernova, Feb 14 '21 at 13:29
Same opinion than Blackbishop. Either you need rdd or you need to separate the files in 2 differentes archives. You cannot use dataframe API for that job. — Steven, Feb 15 '21 at 09:09

Filtering files using specific pattern when reading tar.gz archive in Pyspark only using dataframe

0 Answers0