2

I ran into a strange MemoryError, and I don't understand why it's there. Code example:

# some setup
import numpy as np
import pandas as pd
import random

blah = pd.DataFrame(np.random.random((100000,2)), columns=['foo','bar'])
blah['cat'] = blah.apply(lambda x: random.choice(['A','B']), axis=1)
blah['bat'] = blah.apply(lambda x: random.choice([0,1,2,3,4,5]), axis=1)

# the relevant part:
blah['test'] = np.where(blah.cat == 'A',
    blah[['bat','foo']].groupby('bat').transform(sum),
    0)

Assigning blah['test'] in this way crashes with a MemoryError, but: if I instead do this:

blah['temp'] = blah[['bat','foo']].groupby('bat').transform(sum)
blah['test'] = np.where(blah.cat == 'A',
    blah['temp'],
    0)

everything works fine. My guess is that there's something about how np.where and .groupby() interact that causes this.

However, if my initial blah only has columns 'foo', 'cat', 'bat' (so no column bar that isn't directly involved in the failing section of code) everything is also fine with the first way of doing it, so that just confuses me more.

What's going on here?

Ketil Tveiten
  • 230
  • 2
  • 10
  • Side note / possibly relevant: use `'sum'` intead of `sum`, you should avoid using Python built-ins with Pandas / NumPy objects. – jpp Dec 13 '18 at 13:18
  • But why, @jpp? Isn't it less overhead when built-ins are used? – ayorgo Dec 13 '18 at 13:20
  • 1
    @ayorgo, Not in the case of NumPy: [see here](https://stackoverflow.com/questions/10922231/pythons-sum-vs-numpys-numpy-sum). – jpp Dec 13 '18 at 13:21
  • @jpp Yes, "avoid using Python builtins" is certainly true here, but I believe that passing `sum` maps to the NumPy ufunc. See `pandas.core.base.SelectionMixin`; `SelectionMixin._builtin_table.get(sum, sum)` – Brad Solomon Dec 13 '18 at 13:32
  • @BradSolomon, Nice, didn't know that! Though I think it's good practice to use strings in the *general case*. That mapping seems to be an implementation detail? – jpp Dec 13 '18 at 13:34

2 Answers2

2

The first portion of your code is simply not correct. If you reduce the dataframe size you'll get

ValueError: Wrong number of items passed 1000, placement implies 1

which suggests that np.where fails to iterate over the single-column dataframe returned by

blah[['bat','foo']].groupby('bat').transform(sum)

and tries to put the entire column to each element of blah['test'] presumably allocating memory for the whole operation in advance which causes the MemoryError.

Changing your implementation to

blah['test'] = np.where(blah.cat == 'A',
                        blah[['bat','foo']].groupby('bat')['foo'].transform(sum),
                        0)

should help.

ayorgo
  • 2,803
  • 2
  • 25
  • 35
  • @jezrael I figured restricting variables first would use slightly less memory (does it?), which is relevant for my actual use-case. Anyhow, this answers my question. – Ketil Tveiten Dec 13 '18 at 14:05
0
blah['test'] = np.where(blah['cat'] == 'A',
    blah[['bat','foo']].groupby('bat')['bat'].transform(sum),
    0)

Notice that I added a ['bat'] at the end of groupby('bat').

My rationale is that your python is hitting MemoryError because it is trying to sum everything in your Dataframe since it isn't defined specifically what you wanted to sum

ycx
  • 3,155
  • 3
  • 14
  • 26
  • My understanding was that it applies the functions in the `.transform()` to whatever in the dataframe isn't in the list of `.groupby()` variables, so when I say `blah[['bat', 'foo']]` it sums `foo` for each `bat` group, which is what I want. In any case, how does this explain the difference in behaviour? – Ketil Tveiten Dec 13 '18 at 13:29
  • @KetilTveiten You're not specifying what you wish to `sum`. You're just specifying which columns you wish to keep in `blah[['bat', 'foo']]` and which column you wish to groupby in `.groupby('bat')` – ycx Dec 13 '18 at 13:35
  • Ok, but the groupby/sum works fine when not inside the `np.where`, this is where the confusion arises. – Ketil Tveiten Dec 13 '18 at 13:39
  • @KetilTveiten It is not working fine. You will notice you are putting the entire dataframe into your existing dataframe and the values will be incorrect – ycx Dec 13 '18 at 13:49