0

There are multiple parquet files that need to be read for 15-16 years in pyspark. Below is example of one such year.

yt_2009= spark.read.parquet("s3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2009-*")

I am trying to create a function to read all the parquet files at once to reduce the duplication that may exist. Below is the code that I wrote:

list_year = ['yt_2021', 'yt_2020',...]
list_files = ['"s3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2021-*"','"s3:/x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2020-*"',....]

def read_parquet_multiple(list_year,list_files):
    for i in range(len(list_year)):
        list_year[i]=spark.read.parquet(list_files[i])

However I am facing the following error when I am trying to run the function:

pyspark.sql.utils.IllegalArgumentException: "java.net.URISyntaxException: Illegal character in scheme name at index 0: 's3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2021-*'"

Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 0: "s3://x/y/nyc_taxi/yellow_taxi/yellow_tripdata_2021-*%22

Not sure what the issue is, need some guidance and help in fixing this one?

goonerboi
  • 309
  • 6
  • 18
  • the elements within the `list_year` list are string, and you can't assign a dataframe to a string -- needs to be a variable. the next best option on what you're trying to achieve would be to use dict, where the key acts as dataframe name (variable) and its value would be the dataframe. e.g. `df_dict = {} ; df_dict[list_year[i]] = spark.read.parquet(list_files[i])` -- [here's](https://stackoverflow.com/q/72983940/8279585) a similar approach – samkart Oct 31 '22 at 05:30
  • also, the error looks like it is a path exception -- must be due to the erroneous quotes. – samkart Oct 31 '22 at 05:34

0 Answers0