There are multiple parquet files that need to be read for 15-16 years in pyspark. Below is example of one such year.
yt_2009= spark.read.parquet("s3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2009-*")
I am trying to create a function to read all the parquet files at once to reduce the duplication that may exist. Below is the code that I wrote:
list_year = ['yt_2021', 'yt_2020',...]
list_files = ['"s3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2021-*"','"s3:/x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2020-*"',....]
def read_parquet_multiple(list_year,list_files):
for i in range(len(list_year)):
list_year[i]=spark.read.parquet(list_files[i])
However I am facing the following error when I am trying to run the function:
pyspark.sql.utils.IllegalArgumentException: "java.net.URISyntaxException: Illegal character in scheme name at index 0: 's3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2021-*'"
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 0: "s3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2021-*%22
Not sure what the issue is, need some guidance and help in fixing this one?