This is in continuation with this post. Here is the statement. I have multiple CSV files in my folder myfolder.tar.gz
. Which I created in this way: first put all my files in a folder name myfolder
then prepare a tar folder of it. Then prepare .gz of that tar folder.
Let us say we have 5 files.
abc_1.csv
abc_2.csv
abc_3.csv
def_1.csv
def_2.csv
I want to filter read files in a specific filename pattern using only Pyspark data frame. Like we want to read all abc
files together. This should not give us the results from def and vice versa.
The solution in this post by @blackbishop uses an rdd to extract the files and then converting it to a dataframe. This is working perfectly fine but issue is with performace for huge files. Is there any way that we can only do the same with use of pyspark dataframe reader. We have to only use dataframe and no rdd. Can we do the same?