Proper DataFrame slice modification

Question

I'm trying to modify a set of columns from a set of rows and of course I get the warning about:

A value is trying to be set on a copy of a slice from a DataFrame

I saw a similar question here but can't wrap my head around it.

So if we follow this example code:

from random import random as rd
ex= pd.DataFrame([{"group": ["a","b"][int(round(rd()))], "colA": rd()*10, "colB": rd()*10, "colC": rd()*10,  "colD": rd()*10} for _ in range(20)])
cols = [col for col in ex.columns if col != "group"]

I would like to modify only the rows that belong to group a and only on columns cols, for which I kind of intuitively go to try (and get the caveat warning):

ex[ex["group"]=="a"][cols] = ex[ex["group"]=="a"][cols]/ex.ix[0,cols]

The number of columns match and have the same labels, so I wonder if would have to go one by one like:

for idx in ex[ex["group"]=="a"].index:
    for col in cols:
        ex.ix[idx, col]=ex.ix[idx, col]/ex.ix[0,col]

Which of course works, but kind of feels like a step back. So what would be the correct way of doing something like this?

EdChum · Accepted Answer · 2015-10-08T09:28:55.570

IIUC you can do this in one step using .loc, your boolean condition and pass the list of cols:

In [110]:
from random import random as rd
ex= pd.DataFrame([{"group": ["a","b"][int(round(rd()))], "colA": rd()*10, "colB": rd()*10, "colC": rd()*10,  "colD": rd()*10} for _ in range(20)])
cols = [col for col in ex.columns if col != "group"]
ex

Out[110]:
        colA      colB      colC      colD group
0   5.895114  3.961007  0.589091  9.846131     a
1   1.789049  7.532745  2.767378  9.144689     b
2   1.218778  2.715299  3.626688  6.516540     a
3   9.327049  3.207037  4.513850  1.910565     b
4   1.822876  0.049689  0.794706  8.463579     a
5   1.451741  6.045066  6.575130  4.882635     b
6   6.741825  4.253489  2.162466  1.050275     a
7   5.186613  3.401384  1.055468  4.060071     a
8   0.921352  8.076272  6.727293  3.219364     a
9   3.209232  8.883085  9.696195  4.089006     b
10  0.970030  6.412611  5.377420  5.475744     b
11  7.905807  4.576925  6.991989  2.974597     b
12  4.907642  7.123328  9.851058  2.337944     b
13  1.191606  2.636071  5.740342  3.301008     b
14  1.454777  3.086801  3.573110  1.402692     b
15  3.253882  1.853393  5.156287  8.268881     b
16  4.779060  4.689739  1.228976  6.339238     b
17  7.950160  4.973974  4.304821  4.492152     b
18  0.581628  6.860053  2.974577  6.542594     a
19  6.872025  9.216597  0.936447  5.518941     b

In [111]:    
ex.loc[ex['group']=='a', cols] /= ex.iloc[0][cols]
ex

Out[111]:
        colA      colB       colC      colD group
0   1.000000  1.000000   1.000000  1.000000     a
1   1.789049  7.532745   2.767378  9.144689     b
2   0.206744  0.685507   6.156417  0.661838     a
3   9.327049  3.207037   4.513850  1.910565     b
4   0.309218  0.012545   1.349039  0.859584     a
5   1.451741  6.045066   6.575130  4.882635     b
6   1.143629  1.073840   3.670853  0.106669     a
7   0.879816  0.858717   1.791690  0.412352     a
8   0.156291  2.038944  11.419789  0.326967     a
9   3.209232  8.883085   9.696195  4.089006     b
10  0.970030  6.412611   5.377420  5.475744     b
11  7.905807  4.576925   6.991989  2.974597     b
12  4.907642  7.123328   9.851058  2.337944     b
13  1.191606  2.636071   5.740342  3.301008     b
14  1.454777  3.086801   3.573110  1.402692     b
15  3.253882  1.853393   5.156287  8.268881     b
16  4.779060  4.689739   1.228976  6.339238     b
17  7.950160  4.973974   4.304821  4.492152     b
18  0.098663  1.731896   5.049437  0.664484     a
19  6.872025  9.216597   0.936447  5.518941     b

Timings

In [112]:
%%timeit
for idx in ex[ex["group"]=="a"].index:
    for col in cols:
        ex.ix[idx, col]=ex.ix[idx, col]/ex.ix[0,col]
100 loops, best of 3: 11 ms per loop

In [113]:
%timeit ex.loc[ex['group']=='a', cols] /= ex.iloc[0][cols]
100 loops, best of 3: 5.3 ms per loop

So on your small sample size my method is over 2X faster, I expect it to scale much better with larger datasets as it's vectorised

As an update: when working with column names which are floats, e.g. ```[1.1, 2.1, 3.4, 4.5]``` instead of ```["colA", "colB", "colC", "colD"]``` the slicing will work when you want to look at it, but if want to do the attribution to that slice, it will not work! — Animismus, Oct 12 '15 at 12:24
It's generally not a good idea to use floats for column or index values — EdChum, Oct 12 '15 at 12:25

Proper DataFrame slice modification

1 Answers1