-1

I have the following snippet which iterates over a list of .csv files and then uses a insert_csv_data function which reads, preprocesses and inserts the .csv file's data into a .hyper file (Hyper is Tableau's new in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets):

A detailed interpretation of the insert_csv_data function can be found here

for csv in csv_list:
            insert_csv_data(hyper)

The issue with the above code is that it inserts one .csv file into the .hyper file at a time, which is pretty slow at the moment.

I would like to know if there's a faster or parallel workaround as I'm using Apache Spark for processing on Databricks. I've done some research and found modules like multiprocessing, joblib and asyncio that might work for my scenario, but I'm unsure of how to correctly implement them.

Please Advise

Edit:

Parallel Code:

from joblib import Parallel, delayed
element_run = Parallel(n_jobs=1)(delayed(insert_csv_data)(csv) for csv in csv_list)
The Singularity
  • 2,428
  • 3
  • 19
  • 48

1 Answers1

1

This does not directly answer the question but demonstrates how multiprocessing and multithreading are easily interchangeable using the concurrent.futures module. Note that the two loops achieve exactly the same thing and that the only difference between the two sections of code the is the work manager class.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


def tfunc(n):
    return n * n


N = 1_000


def main():
    with ThreadPoolExecutor() as executor:
        for future in [executor.submit(tfunc, n) for n in range(N)]:
            future.result()

    with ProcessPoolExecutor() as executor:
        for future in [executor.submit(tfunc, n) for n in range(N)]:
            future.result()


if __name__ == '__main__':  
    main()
  • No because the two lines of code you're presented make no sense. What is *hyper*? You don't seem to be using *csv*. What does *insert_csv_data* do? Crucial questions that need to be answered –  Oct 11 '21 at 07:17
  • In that case I suggest you try both techniques and measure the outcome. –  Oct 11 '21 at 07:26
  • Thank you. Here's a more appropriate [reference](https://github.com/tableau/hyper-api-samples/blob/main/Tableau-Supported/Python/create_hyper_file_from_csv.py) if you have the time. – The Singularity Oct 11 '21 at 07:27
  • Given that Hyper seems to be an in-memory technology, I would start with multiprocessing –  Oct 11 '21 at 07:30
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/238012/discussion-between-brutusforcus-and-luke). –  Oct 11 '21 at 07:31
  • I strongly recommend that you read https://joblib.readthedocs.io/en/latest/parallel.html and also just go ahead and adapt my code for your purposes –  Oct 11 '21 at 07:38
  • `ProcessPoolExecutor` took longer than the usual time, and `ThreadPoolExecutor` didn't work as `Hyper` doesn't support making multiple connections at the same time. Thanks a lot for the assistance, I'll take up the issue with Tableau – The Singularity Oct 11 '21 at 09:09