I have two large csv files I gather from an api. 99.9% of the time, the files have the same number of rows and the same columns and data, except two or three columns that are different between the files. I m performing an outer merge on the files based on 4 columns.However the merge time takes a lot of time, ~8 minutes for two files of 2.7 gb each , for 4GB files it takes around ~12 minutes. How can I speed up the merge?
I use python 3.6.9 and dask 2021.3.0 on a server with 50GB RAM and 24 cores.I tried to set up indexes and merge on indexes but I got no improvement in how much it took. I cannot use apache parquet either. I get csv files and I need to export the data to a single csv file as well.