2

I have those packages installed:

python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1

This is the dataframe info:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421570 entries, 2010-02-05 00:00:00 to 2012-10-26 00:00:00
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
Date_Str        421570 non-null object
dtypes: bool(1), float64(1), int64(2), object(1)None

this is a sample how data look like:

Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010-02-05,24924.5,FALSE
1,1,2010-02-12,46039.49,TRUE
1,1,2010-02-19,41595.55,FALSE
1,1,2010-02-26,19403.54,FALSE
1,1,2010-03-05,21827.9,FALSE
1,1,2010-03-12,21043.39,FALSE
1,1,2010-03-19,22136.64,FALSE
1,1,2010-03-26,26229.21,FALSE
1,1,2010-04-02,57258.43,FALSE

I load the file and index it as follows:

df_train = pd.read_csv('train.csv')
df_train['Date_Str'] = df_train['Date']
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.set_index(['Date'])

when I the following operation with a 400K rows file,

df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)

or

df_train['try'] = df_train['Store'] * df_train['Dept']

it causes an error:

Traceback (most recent call last):
  File "rock.py", line 85, in <module>
    rock.pandasTest()
  File "rock.py", line 31, in pandasTest
    df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype('str')
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 480, in wrapper
    return_indexers=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tseries/index.py", line 976, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1304, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1345, in _join_non_unique
    how=how, sort=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tools/merge.py", line 465, in _get_join_indexers
    return join_func(left_group_key, right_group_key, max_groups)
  File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError

However, it works fine with a small file.

dcc
  • 1,729
  • 4
  • 16
  • 17

2 Answers2

2

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

In [23]: pd.__version__ 
Out[23]: '0.14.0'

In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop

In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop

In [26]: %load_ext memory_profiler

In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB

In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB
joris
  • 133,120
  • 36
  • 247
  • 202
  • BTW, I found an alternative way that has comparable performance with astype(str) in terms of memory usage and speed: df_train['Store'].map(str)+'_' + df_train['Dept'].map(str) + '_' + df_train['Date_Str'].map(str) – dcc Jun 01 '14 at 01:52
  • thanks for mentioning `%memit`, didn't know about it before. – Dennis Golomazov May 08 '17 at 20:48
1

Try generating the _id field with DataFrame.apply call:

def apply_id(x):
    x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
    return x

df_train = df_train.apply(apply_id, 1)

When using apply the id generation is performed per row resulting in minimal overhead in memory allocation.

jsalonen
  • 29,593
  • 15
  • 91
  • 109
  • yeah, this way works, but in this thread, http://stackoverflow.com/questions/23950658/python-pandas-operate-on-row, it is said the vectorized function is faster than using apply call, and from my experiments it seems true. The vectorized functions tend to use more memory than apply call, but the confusion is that I still have lots of memory left when the memory error occurs – dcc May 30 '14 at 14:25
  • 1
    I'm guessing that vectorized functions need to keep the whole vectors in memory while performing the operation and in your case that's way too much memory required. Also I think you can get MemoryError even before you actually run ouf of memory. Python is probably trying to allocate huge chunk of memory and it fails -> doesn't show any increase in memory consumption as it fails instantly. – jsalonen May 30 '14 at 14:29
  • Indeed, but the strange thing is this should not happen at all for a dataframe of this size (and I also can't reproduce it) – joris May 30 '14 at 14:38
  • Well note that you are not only appending three values but also converting them to strings. Vectorized string values -> BOOM – jsalonen May 30 '14 at 14:40
  • just confused why the vectorized functions would consume so much memory, isn't vector the preferred way to do column manipulation in Pandas? Also, the pandas's dataframe is similar to R's dataframe and R is able to handle the vector calculation pretty efficient on the same computer. – dcc May 30 '14 at 14:40
  • Okay I stand corrected: I fiddled a little bit with the vectorized operations and indeed I can get the memory error even without strings. Back to drawing board... – jsalonen May 30 '14 at 14:44
  • @jsalonen Can you reproduce the memory error with a dataframe of this size? For one int64 column, this would give some 3 Mb of data if I am not mistaken, so even with converting it to strings, or having 3 of these columns, this should not give a problem on most computers these days. – joris May 30 '14 at 19:49
  • I can reproduce the memory problem. Still investigating further but no luck on faster approaches. Also timed the apply solution and it takes about 5 minutes with 500k rows – jsalonen May 30 '14 at 20:19
  • I tried the same thing on an 8G Ram machine and the error remained – dcc May 30 '14 at 23:55
  • 1
    Sorry, I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13. So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), *and* using less peak memory (200Mb vs 980Mb, with `%memit`) on 0.14. – joris May 31 '14 at 11:14