0

i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of cols or the datatype of certain col). My requirement is that i have to read all the part folders and finally create a single df according to a predefined passed schema.

/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231

Since i am not sure what type of changes are there in which part folders i am reading each part folder individually then comparing the schema with teh predefined schema and making the necessary changes i..e, adding/dropping col or typecasting the col datatype. Once done i am writing the result into a temp location and then moving on to the next part folder and repeating the same operation. Once all the part-folders are read i am reading the temp location at one go to get the final output.

Now i want to do this operation parallely, i.e., there will be parallel thread/process (?) which will read part-folders parallely and then execute teh logic of schema comparison and any changes necessary and write into a temp location . Is this thing possible ?

i searched for parallel processing of multi-dir in here but in majority of the scenarios they have same schema across dir so somehow they are using wildcard to read the input path location and create the df, but that is not going to work in my case. The problem statement in the below path is similar to mine but in my case the num of part folders to be read is random and sometimes over 1000. Moreover there are operation involved in comparing the fixing the col types as well. Any help will be appreciated.

Reading multiple directories into multiple spark dataframes

1 Answers1

0

Divide your existing ETL into two phases. The first one transforms existing data into the appropriate schema, and the second one reads the transformed data convenient way (with * symbols). Use Airflow (or Oozie) to start one data transformer application per directory. And after all instances of the data transformer are successfully finished, run the union app.

Gleb Yan
  • 421
  • 3
  • 5
  • 1
    We dont have Airflow/oozie in our shop and i am looking to do it in a single code. Can threadpool be used to solve this. As far my idea goes when i create a threadpool then the main driver spawns multiple threads and each thread will have access to the sparkSession and executed the code. In my case, the ideal scenario will be where reading and processing of each part-folder as well as write into a temp location should be done by an individual thread. And once all the threads have completed the rest of the code will be just reading that temp location. Is this approach even possible ?? – Kaushik Ghosh Jan 25 '23 at 05:19
  • Yep, you could run a bundle of threads on the driver, and manually distribute the file directories between threads. You could obtain a spark session in each thread or share existing sessions (be careful, I don't actually know if a spark session is thread-safe or not). – Gleb Yan Jan 26 '23 at 08:48