7

I am on Windows. I have a CSV with ~87 million rows (10-12 columns). I am using Jupyter notebook and I am able to read it in successfully with Pandas. I also have another csv that's around 100K rows and can read that in. The problem occurs when I try to (left outer) join the two. I always end up getting an error along the lines of

MemoryError: Unable to allocate __ GiB for array with shape (__,__) and data type ___. 

I have tried removing unnecessary columns and converting the Pandas dataframes to recarrays and joining them that way -- didn't work. I also tried changing the data types of the columns in those recarrays as small as I could -- also didn't help. The __ GiB in the error also changes -- I've seen 1.5, 3, 12... I have 18.4 GB of "total paging file size for all drives" and cannot change this setting. I also have 77.3 GB free of local storage, so I don't think this is the problem.

I have seen another answer where they changed overcommit memory, but it was for Linux. Is that a possible solution on Windows? Does it seem like it is something to do with Jupyter or my machine in general? Any help would be much appreciated.

Aakash Parsi
  • 93
  • 2
  • 10
formicaman
  • 1,317
  • 3
  • 16
  • 32

0 Answers0