3

I have a csv file with ~50,000 rows and 300 columns. Performing the following operation is causing a memory error in Pandas (python):

merged_df.stack(0).reset_index(1)

The data frame looks like:

GRID_WISE_MW1   Col0    Col1    Col2 .... Col300
7228260         1444    1819    2042
7228261         1444    1819    2042

I am using latest pandas (0.13.1) and the bug does not occur with dataframes with fewer rows (~2,000)

thanks!

user308827
  • 21,227
  • 87
  • 254
  • 417
  • That wouldn't help here because I am using merged_df.stack(0).reset_index(1) in a pandas.merge operation.... – user308827 Apr 21 '14 at 20:54

2 Answers2

5

So it takes on my 64-bit linux (32GB) memory, a little less than 2GB.

In [5]: def f():
       df = DataFrame(np.random.randn(50000,300))
       df.stack().reset_index(1)


In [6]: %memit f()
maximum of 1: 1791.054688 MB per loop

Since you didn't specify. This won't work on 32-bit at all (as you can't usually allocate a 2GB contiguous block), but should work if you have reasonable swap / memory.

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Ahh, I am using Windows 7 64 bit, 8 GB RAM, but my pandas is 32 bit, could that be the issue? – user308827 Apr 21 '14 at 23:32
  • 2
    yep; you can install 64-bit python (and all packages), or use ``conda`` to do so. 32-bit has a 4GB addressable limit, but python requires contiguous memory, so that's too big to stack reliably. in my experience 32-bit has issues with anything > 1GB; 64-bit scales no problem however. – Jeff Apr 21 '14 at 23:39
  • 1
    @Jeff Thanks for the remark! I've been fighting for a good week with `pandas` to load only ~400MB of data in one `dataFrame`, when a list of smaller `dataFrame` instances, for the same total amount, can be loaded without a problem, and your explanation is surely the answer: I'm using Python in 32 bits, as our OSs at work are stuck on a Windows 32 bits. :-/ – Joël Dec 08 '15 at 15:09
2

As an alternative approach you can use the library "dask"
e.g:

# Dataframes implement the Pandas API
import dask.dataframe as dd`<br>
df = dd.read_csv('s3://.../2018-*-*.csv')