I ran into a strange MemoryError
, and I don't understand why it's there. Code example:
# some setup
import numpy as np
import pandas as pd
import random
blah = pd.DataFrame(np.random.random((100000,2)), columns=['foo','bar'])
blah['cat'] = blah.apply(lambda x: random.choice(['A','B']), axis=1)
blah['bat'] = blah.apply(lambda x: random.choice([0,1,2,3,4,5]), axis=1)
# the relevant part:
blah['test'] = np.where(blah.cat == 'A',
blah[['bat','foo']].groupby('bat').transform(sum),
0)
Assigning blah['test']
in this way crashes with a MemoryError
, but: if I instead do this:
blah['temp'] = blah[['bat','foo']].groupby('bat').transform(sum)
blah['test'] = np.where(blah.cat == 'A',
blah['temp'],
0)
everything works fine. My guess is that there's something about how np.where
and .groupby()
interact that causes this.
However, if my initial blah
only has columns 'foo', 'cat', 'bat'
(so no column bar
that isn't directly involved in the failing section of code) everything is also fine with the first way of doing it, so that just confuses me more.
What's going on here?