Python 3.6: Compare two large gzipped csv files & fetch difference records

Question

I have 2 gzipped csv files IMFBOP2017_1.csv.gz and IMFBOP2017_2.csv.gz with same columns in both file i.e "Location, Indicator, Measure, Unit, Frequency, Date".

Total rows 60 millions+

I want to compare both file & display rows of IMFBOP2017_1 that are not present in IMFBOP2017_2.

My plan is to import both files to dataframes , add an extra column "compare" to both dataframes and update it by all fields merge like

Location|Indicator|Measure|Unit|Frequence|Date and do NOT IN operation.

I think this is a costly process, is there any simple solution for this?

you should try first, then ask what can be improved and provide code — Azat Ibrakov, Aug 01 '17 at 06:40
Yes, it's a costly process. Are the rows in the files sorted in any way? — PM 2Ring, Aug 01 '17 at 06:50
Finding duplicated lines is a lot faster if the data is sorted. Perhaps you should use an external program to sort the data, like the *nix `sort` utility, which can easily sort files that are too big to fit into RAM. Once you have sorted data you can easily process it without reading it all into RAM: you just need to read row one row at a time from each file, and you can find the desired lines using a process similar to merging two lists. I don't know Pandas, so I don't know if Panda provides a function for this, but it's easy enough to write in pure Python. — PM 2Ring, Aug 01 '17 at 07:10

score 1 · Accepted Answer · answered Aug 01 '17 at 11:26

1

Pandas can read gzipped data files with the ordinary pandas.read_csv(). How to do a diff between two dataframes is described in Pandas: Diff of two Dataframes .

answered Aug 01 '17 at 11:26

C. Nitschke

199
6

I refered link given above, its working... thanks a lot. – Plinus Aug 05 '17 at 07:58

Python 3.6: Compare two large gzipped csv files & fetch difference records

1 Answers1