I have multiple CSV files in my folder myfolder.tar.gz
. Which I created in this way: first put all my files in a folder name myfolder
then prepare a tar
folder of it. Then prepare .gz
of that tar folder.
Let us say we have 5 files.
abc_1.csv
abc_2.csv
abc_3.csv
def_1.csv
def_2.csv
I want to filter read files in a specific filename pattern using Pyspark data frame. Like we want to read all abc
files together.
This should not give us the results from def
and vice versa. Currently, I am able to read all the CSV files together by just using spark.read.csv()
function. Also, I am able to filter file when I keep the files in a simple folder using pathGlobalFilter
parameter like this:
df = spark.read.csv("mypath",pathGlobalFilter="def_[1-9].csv")
But when I am able to do the same in tar.gz
, like:
df = spark.read.csv("myfolder.tar.gz", pathGlobalFilter="def_[1-9].csv")
I am getting an error:
Unable to infer Schema for CSV. How to read from .tar.gz file.