I have the following snippet which iterates over a list of .csv
files and then uses a insert_csv_data
function which reads, preprocesses and inserts the .csv
file's data into a .hyper
file (Hyper is Tableau's new in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets):
A detailed interpretation of the insert_csv_data
function can be found here
for csv in csv_list:
insert_csv_data(hyper)
The issue with the above code is that it inserts one .csv
file into the .hyper
file at a time, which is pretty slow at the moment.
I would like to know if there's a faster or parallel workaround as I'm using Apache Spark for processing on Databricks. I've done some research and found modules like multiprocessing
,
joblib
and asyncio
that might work for my scenario, but I'm unsure of how to correctly implement them.
Please Advise
Edit:
Parallel Code:
from joblib import Parallel, delayed
element_run = Parallel(n_jobs=1)(delayed(insert_csv_data)(csv) for csv in csv_list)