2

I'm trying to replace values in a Pandas data frame, based on certain criteria on multiple columns. For a single column criteria this can be done very elegantly with a dictionary (e.g. Remap values in pandas column with a dict):

import pandas as pd

df = pd.DataFrame({'col1': {0:1, 1:1, 2:2}, 'col2': {0:10, 1:20, 2:20}})

rdict = {1:'a', 2:'b'}
df2 = df.replace({"col1": rdict})

Input df:

   col1  col2
0     1    10
1     1    20
2     2    20

Resulting df2:

  col1  col2
0    a    10
1    a    20
2    b    20

I'm trying to extend this to criteria over multiple columns (e.g. where col1==1, col2==10 -> replace). For a single criteria this can be done like:

df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'c'

Which results in a df3:

  col1  col2
0    c    10
1    1    20
2    2    20 

My real life problem has a large number of criteria, which would involve a large number of df3.loc[((criteria1)&(criteria2)), column] = value calls, which is far less elegant the the replacement using a dictionary as a "lookup table". Is it possible to extend the elegant solution (df2 = df.replace({"col1": rdict})) to a setup where values in one column are replaced by criteria based on multiple columns?


An example of what I'm trying to achieve (although in my real life case the number of criteria is a lot larger):

df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})

df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'a'
df3.loc[((df['col1']==1)&(df['col2']==20)), 'col1'] = 'b'
df3.loc[((df['col1']==2)&(df['col2']==10)), 'col1'] = 'c'
df3.loc[((df['col1']==2)&(df['col2']==20)), 'col1'] = 'd'

Input df:

0     1    10
1     1    20
2     2    10
3     2    20

Resulting df3:

  col1  col2
0    a    10
1    b    20
2    c    10
3    d    20
Bart
  • 9,825
  • 5
  • 47
  • 73

3 Answers3

2

Demo:

Source DF:

In [120]: df
Out[120]:
   col1  col2
0     1    10
1     1    10
2     1    20
3     1    20
4     2    10
5     2    20
6     3    30

Conditions & Replacements DF:

In [121]: cond
Out[121]:
   col1  col2 repl
1     1    20    b
2     2    10    c
0     1    10    a
3     2    20    d

Solution:

In [121]: res = df.merge(cond, how='left')

yields:

In [122]: res
Out[122]:
   col1  col2 repl
0     1    10    a
1     1    10    a
2     1    20    b
3     1    20    b
4     2    10    c
5     2    20    d
6     3    30  NaN   # <-- NOTE


In [123]: res['col1'] = res.pop('repl').fillna(res['col1'])

In [124]: res
Out[124]:
  col1  col2
0    a    10
1    a    10
2    b    20
3    b    20
4    c    10
5    d    20
6    3    30
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • As mentioned, my real life problem has a lot of different conditions like `cond = "col1==1 and col2==10"`, unfortunately I don't see how this solves my problem. – Bart Jan 27 '18 at 13:32
  • @Bart, OK, you have multiple conditions, but do you have a single replacement dict or multiple? – MaxU - stand with Ukraine Jan 27 '18 at 13:33
  • I'm not sure if I understand "but do you have a single replacement dict or multiple"; I have multiple criteria, which all map to a different replacement value. – Bart Jan 27 '18 at 13:40
  • Interesting solution, this might work well if I base `cond` on (for example) an input CSV file where the conditions and replacements are defined. I'll give it a try, unless someone posts a more elegant solution, I'll accept this answer. – Bart Jan 27 '18 at 13:51
2

We can use merge.

Suppose your df looks like

df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2, 4:2, 5:1}, 'col2': {0:10, 1:20, 2:10, 3:20, 4: 20, 5:10}})

    col1 col2
0   1    10
1   1    20
2   2    10
3   2    20
4   2    20
5   1    10

And your conditional replacement can be represented as another dataframe:

df_replace

  col1  col2    val
0   1   10      a
1   1   20      b
2   2   10      c
3   2   20      d

(As OP (Bart) pointed out, you can save this in a csv file.)

Then you can use

df = df.merge(df_replace, on=["col1", "col2"], how="left")

    col1    col2    val
0   1       10      a
1   1       20      b
2   2       10      c
3   2       20      d
4   2       20      d
5   1       10      a

Then you just need to drop col1.

As MaxU pointed out, there could be rows that does not get replaced and resulting in NaN. We can use a line like

df["val"] = df["val"].combine_first(df["col1"])

to fill in values from col1 if the resulting values after merge is NaN.

Tai
  • 7,684
  • 3
  • 29
  • 49
  • Great, this will work. I can create `df_replace` from (for example) a CSV file where the conditions/replacements are defined, to keep the code clean. – Bart Jan 27 '18 at 14:05
  • @Bart that is a good idea to keep it in another csv file! Let me add it. – Tai Jan 27 '18 at 14:06
  • 1
    You will still need to take care of those rows that will not be replaced... – MaxU - stand with Ukraine Jan 27 '18 at 14:06
  • @MaxU it is a left merge so it shall be fine. The value on `df` that does not matched will be preserved. – Tai Jan 27 '18 at 14:07
  • @MaxU, is that why you deleted your answer? It seemed like a perfectly fine solution as well. In my case all combinations of `col1` and `col2` are guaranteed (*famous last words..*) to map to a new value. – Bart Jan 27 '18 at 14:09
1

This method is likely to be more efficient than pandas functionality, as it relies on numpy arrays and dictionary mappings.

import pandas as pd

df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})

rdict = {(1, 10): 'a', (1, 20): 'b', (2, 10): 'c', (2, 20): 'd'}

df['col1'] = list(map(rdict.get, [(x[0], x[1]) for x in df1[['c1', 'c2']].values]))
jpp
  • 159,742
  • 34
  • 281
  • 339
  • This doesn't produce the same results as the `df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'c'`, which only replaces values in a single column based on a criterium over multiple columns – Bart Jan 27 '18 at 13:30
  • @Bart, what about this solution? – jpp Jan 27 '18 at 14:41
  • That's great, I like it as it uses the same (sort of) method with dictionaries for the replacement (like the `df.replace({"col1": rdict})` method for a single column criteria). I have to see if the difference in performance is relevant in my case, if so this might be my preferred solution. – Bart Jan 27 '18 at 15:02
  • I coded a quick benchmark, this actually seems to be slower (factor ~3.5)... (https://pastebin.com/zSEwJaYR, see benchmark results on the bottom). – Bart Jan 27 '18 at 19:09
  • 1
    @Bart, thanks for taking the effort to benchmark. I've changed my logic very slightly to use a list comprehension (small speed up). Where you lose with the `df` merge method is creation of `convert_df`. Including set-up costs of `convert_df` and `convert_dict`, I see very close timings for n_rows = 10000, cats = 5. If you try n_rows = 10000, cats = 50, I find that the setup cost for `convert_df` is ~1.5s (~600x worse than `convert_dict`). So you are right to go with `df.merge`, but only if the mapping is static / built once / not repeatedly read from csv. – jpp Jan 27 '18 at 19:55