How to optimize the merging of two huge csv files using Pandas and Dask

Question

I have two large csv files I gather from an api. 99.9% of the time, the files have the same number of rows and the same columns and data, except two or three columns that are different between the files. I m performing an outer merge on the files based on 4 columns.However the merge time takes a lot of time, ~8 minutes for two files of 2.7 gb each , for 4GB files it takes around ~12 minutes. How can I speed up the merge?

I use python 3.6.9 and dask 2021.3.0 on a server with 50GB RAM and 24 cores.I tried to set up indexes and merge on indexes but I got no improvement in how much it took. I cannot use apache parquet either. I get csv files and I need to export the data to a single csv file as well.

@DataSciRookie i do the merge like this: dd.merge(dframe1, dframe2, how='outer', on= ['ts_ns',src_ip', 'src_port', 'msg type'] — Expired, Sep 09 '22 at 13:58
a merge on a out-of-core dataframe is just a large operation. you're explicitly making a tradeoff of time vs. memory bottlenecks by using dask.dataframe. if you have that much memory on your machine, you might be able to do this more quickly just using pandas. but if this is working... why is it a problem if it takes 12 minutes? are you running into a problem or is it just the time? — Michael Delgado, Sep 09 '22 at 17:09
you might be able to get a speedup by parsing the data using dask and then doing the merge in pandas, like `df = pd.merge(dd.read_csv(...).compute(), dd.read_csv(...).compute())` — Michael Delgado, Sep 09 '22 at 17:11
@MichaelDelgado, using pandas for merge did indeed speed up the process. It now takes ~5min30s for a 2.7GB file and ~8.30 min for files of 4GB, an increase of ~30% in speed at the expense of memory usage. Do you have any other tips for improving the speed? Do you know other data frameworks better at handling huge csv files? — Expired, Sep 12 '22 at 10:44
Just these: https://stackoverflow.com/a/69153327/3888719. But if you can use any binary format such as parquet that would certainly be preferable. — Michael Delgado, Sep 12 '22 at 14:16

score 0 · Answer 1 · answered Sep 09 '22 at 13:41

Ok, you can try like this : The solution is juste to create an unique id with the four columns and to get the row who does not match, which will be the same thing as doing an outer

import numpy as np 
import pandas as pd
classes = [('Carbon', '16.7', '1','9'),
         ('Pyruvate', '30', '7','8'),
         ('Lipid', '40.5', '9','10'),
         ('Galactose', '57', '10','12'),
         ('Fatty', '22', '4','14')]
labels = ['A','B', 'C','D']
df_1 = pd.DataFrame.from_records(classes, columns=labels)

classes_ = [('Carbon', '16.78', '1','9'),
         ('Pyruvate', '30', '7','8'),
         ('Lipid', '40.5', '9','10')]
labels = ['A','B', 'C','D']
df_2= pd.DataFrame.from_records(classes_ ,columns=labels)

#Create id : 
df_1['id_df_1']=df_1['A']+"_"+df_1['B']+'_'+df_1['C']+'_'+df_1['D']
df_2['id_df_2']=df_2['A']+"_"+df_2['B']+'_'+df_2['C']+'_'+df_2['D']

#Result for df1 and df2
df_2.loc[~df_2['id_df_2'].isin(list(df_1['id_df_1'].values)),:]
df_1.loc[~df_1['id_df_1'].isin(list(df_2['id_df_2'].values)),:]

this seems like a really bad idea for a dask dataframe of this size, and also I don't see how this accomplishes a merge. can you elaborate on why the creation of this huge ID column is helpful and how this is equivalent to a merge? I don't think it is. — Michael Delgado, Sep 09 '22 at 17:05
How is this different than an outer merge on a single column, say ts_ns (the unix timestamp column)? Btw I just tested a merge on a single column, there is no performance increase. — Expired, Sep 12 '22 at 07:12

How to optimize the merging of two huge csv files using Pandas and Dask

1 Answers1