0

I have 2 gzipped csv files IMFBOP2017_1.csv.gz and IMFBOP2017_2.csv.gz with same columns in both file i.e "Location, Indicator, Measure, Unit, Frequency, Date".

Total rows 60 millions+

I want to compare both file & display rows of IMFBOP2017_1 that are not present in IMFBOP2017_2.

My plan is to import both files to dataframes , add an extra column "compare" to both dataframes and update it by all fields merge like

Location|Indicator|Measure|Unit|Frequence|Date and do NOT IN operation.

I think this is a costly process, is there any simple solution for this?

Thierry Lathuille
  • 23,663
  • 10
  • 44
  • 50
Plinus
  • 308
  • 1
  • 3
  • 10
  • 1
    you should try first, then ask what can be improved and provide code – Azat Ibrakov Aug 01 '17 at 06:40
  • Yes, it's a costly process. Are the rows in the files sorted in any way? – PM 2Ring Aug 01 '17 at 06:50
  • @ PM 2Ring, rows are not sorted. – Plinus Aug 01 '17 at 06:52
  • 1
    Finding duplicated lines is a lot faster if the data is sorted. Perhaps you should use an external program to sort the data, like the *nix `sort` utility, which can easily sort files that are too big to fit into RAM. Once you have sorted data you can easily process it without reading it all into RAM: you just need to read row one row at a time from each file, and you can find the desired lines using a process similar to merging two lists. I don't know Pandas, so I don't know if Panda provides a function for this, but it's easy enough to write in pure Python. – PM 2Ring Aug 01 '17 at 07:10
  • @PM 2Ring, thanks for suggestion, will apply this – Plinus Aug 05 '17 at 06:29

1 Answers1

1

Pandas can read gzipped data files with the ordinary pandas.read_csv(). How to do a diff between two dataframes is described in Pandas: Diff of two Dataframes .

C. Nitschke
  • 199
  • 6