2

I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:

 import pandas as pd
 import sys
 a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
 b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
 chunksize = 10 ** 6
 for chunk in a(chunksize=chunksize):
   merged = chunk.merge(b, on='Bin_ID')
   merged.to_csv("output.csv", index=False,sep='\t')

However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?

cs95
  • 379,657
  • 97
  • 704
  • 746
AishwaryaKulkarni
  • 774
  • 1
  • 8
  • 19

1 Answers1

3

This is already discussed in other posts like the one you mentioned (this, or this, or this).

As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.

Minimum working example:

import dask.dataframe as dd

# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')

# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()

# Save the merged dataframe
df.to_csv('merged.csv', index=False)
iipr
  • 1,190
  • 12
  • 17
  • Eventually you can fire directly `to_parquet()`/`to_csv()` instead of `.compute()` – rpanai Dec 12 '18 at 14:32
  • `to_csv()` is already called in the example above. `compute()` is necessary if you want to obtain results [in memory](http://docs.dask.org/en/latest/user-interfaces.html#laziness-and-computing). If they don't fit in memory then yes, you can save them directly into disk. – iipr Dec 13 '18 at 09:39
  • I guess we are saying the same thing. ;) – rpanai Dec 13 '18 at 14:05
  • How do I install dask dataframe in pipfile? – Sharpless512 Aug 12 '20 at 08:23