Efficiently write large pandas data to different files

Question

I have a pandas dataframe of about 2 million rows (80 columns each).

I would like to output the dataframe to csv as well as a parquet file.

Assume dataframe is present in the df variable

Initial approach:

print('Creating csv and parquet files')
st = time.time()
df.to_csv('output_file.csv')
df.to_parquet('output_file.parquet')
print(f'Created csv and parquet files in {time.time() - st} seconds')

Writing to files using this approach takes too long. I assumed since these two are separate operations, I can take the advantage of multiple processes.

Newer approach:

def build_csv(dataframe, output_filename):
    print(f'Building csv: {output_filename}')
    dataframe.to_csv(output_filename)


def build_parquet(dataframe, output_filename):
    print(f'Building parquet: {output_filename}')
    dataframe.to_parquet(output_filename)


with ProcessPoolExecutor(max_workers=3) as executor:
    executor.submit(build_csv, (df, 'output_file.csv'))
    executor.submit(build_parquet, (df, 'output_file.parquet'))

The newer approach runs successfully but I do not see any files being created. I am not sure why this happens.

Is there a better (quicker) approach for writing a pandas dataframe to different files?

Guess: Subprocesses may have different current working directory. Use absolute paths for the filenames to write to. — Michael Butscher, Feb 03 '20 at 04:17

Daniel · Accepted Answer · 2020-02-03T05:32:07.497

3

EDIT: I kept the threading solution below for your reference. However, this solution should solve the Python GIL problem. I've tested it and can see that the files have been written successfully:

from multiprocessing import Pool
import pandas as pd

# original data:
data = pd.DataFrame([
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,], 
    [ 1, 2, 3, 4,],
])    


def SaveDataToCsv(data):
    print('Started export to .csv')
    data.to_csv('data.csv')
    print('Finished export to .csv')


def SaveDataToParquet(data):
    print('Started export to .parquet')
    data.to_parquet('data.parquet')
    print('Finished export to .parquet')


# multiprocessing method:
pool = Pool(processes=2)
process1 = pool.apply_async(SaveDataToCsv, [data])
process2 = pool.apply_async(SaveDataToParquet, [data])

Tested the threading library and it seems to work fine:

import pandas as pd
import threading

# original data:
data = pd.DataFrame([
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
    [ 1, 2, 3, 4,],
])


def SaveDataToCsv(data):        
    data.to_csv('data.csv')


def SaveDataToParquet(data):
    data.to_parquet('data.parquet')    

thread1 = threading.Thread(target=SaveDataToCsv, args=(data,))
thread2 = threading.Thread(target=SaveDataToParquet, args=(data,))

thread1.start()
thread2.start()

edited Feb 03 '20 at 05:32

answered Feb 03 '20 at 04:36

Daniel

3,228
1
7
23

Thanks. Worked like a charm, reduced quite a bit of time. But how did Python's GIL not pose a problem here?. Afaik threading in this should be as good as running sequentially with additional overhead of switching threads right? – yashtodi94 Feb 03 '20 at 04:54
1

After doing some reading/testing it seems like my answer should not reduce the total time (in fact it may increase it). To answer your question about the Python GIL check out this useful link: https://realpython.com/python-gil/ - Otherwise I'll try to investigate your original error to see if we can't get it to work. Based on the attached link the multiprocessing library is the workaround for the Python GIL. – Daniel Feb 03 '20 at 05:03
Tested a few times on a small subset of the data (100k rows): Sequential: `Created csv and parquet files in 15.59054684638977 seconds` Your approach (Threading): `Created csv and parquet files in 0.006857156753540039 seconds` I got my solution but now I am much more confused as to how things are working. [Here](https://stackoverflow.com/questions/902425/does-multithreading-make-sense-for-io-bound-operations) is a discussion thread regarding this: – yashtodi94 Feb 03 '20 at 05:16
You need to time the results from the threading method after the threads have finished executing (when `thread.isAlive() == False`). Otherwise you are just measuring the time it takes to launch the threads. More on threads: https://docs.python.org/3/library/threading.html – Daniel Feb 03 '20 at 05:29
Profiled the approaches. Threading increases time. Also, the edited approach to use `Pool` does not create files. What am I doing wrong here? – yashtodi94 Feb 03 '20 at 06:03
Were you able to run the script above? I see the files being dumped in the same directory as the script as expected. – Daniel Feb 03 '20 at 07:03
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/207086/discussion-between-yashtodi94-and-daniel-sokol). – yashtodi94 Feb 03 '20 at 08:08

score 0 · Answer 2 · answered Feb 03 '20 at 04:36

0

Since you write to the same disk, the disk is the bottleneck and your multiprocessing will not speed up the operation.

answered Feb 03 '20 at 04:36

Dewald Abrie

1,392
9
21

I disagree, for instance we might be using zip compression when saving, and the intuition says this would hang the process, unless we pass it to a new thread / process. – Anonymous Aug 28 '22 at 06:57

Efficiently write large pandas data to different files

2 Answers2