1

I have the following dataframe:

import pandas as pd

df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'John', 'John','John', 'John','John','John','John','John','John']
                  ,'City': ['A', 'B', 'A','B', 'A','B','B','A','B','A'],'City2': ['C', 'D', 'C','D', 'C','D','D','C','D','C']})
df



     Value  Name  City  City2
    0   0   John    A   C
    1   1   John    B   D
    2   2   John    A   C
    3   3   John    B   D
    4   4   John    A   C
    5   5   John    B   D
    6   6   John    B   D
    7   7   John    A   C
    8   8   John    B   D
    9   9   John    A   C

I am trying to take the average of values when City2 equald 'C' but apply it to the whole new column:

I have tried:

df['C_Average'] = df[df['City2'] == 'C'].groupby(['Name','City'])['Value'].transform(lambda v: v.nsmallest(11).mean())
df
     Value  Name  City City2 C_Average
    0   0   John    A   C   4.4
    1   1   John    B   D   NaN
    2   2   John    A   C   4.4
    3   3   John    B   D   NaN
    4   4   John    A   C   4.4
    5   5   John    B   D   NaN
    6   6   John    B   D   NaN
    7   7   John    A   C   4.4
    8   8   John    B   D   NaN
    9   9   John    A   C   4.4

As you can see the new column is added, but I would like to apply it to the whole column and not just the rows where City2 equals C. IE the whole column to show 4.4. Any ideas?

Thanks!

SOK
  • 1,732
  • 2
  • 15
  • 33

1 Answers1

3

One trick is replace not matched values to missing values instead filtering:

print (df.assign(Value = df['Value'].where(df['City2']== 'C')))
   Value  Name City City2
0    0.0  John    A     C
1    NaN  John    B     D
2    2.0  John    A     C
3    NaN  John    B     D
4    4.0  John    A     C
5    NaN  John    B     D
6    NaN  John    B     D
7    7.0  John    A     C
8    NaN  John    B     D
9    9.0  John    A     C

But problem in sample data is there is no groups with C in groups John, B, so get same ouput:

df['C_Average'] = (df.assign(Value = df['Value'].where(df['City2']== 'C'))
                     .groupby(['Name','City'])['Value']
                     .transform(lambda v: v.nsmallest(11).mean()))

print (df)
   Value  Name City City2  C_Average
0      0  John    A     C        4.4
1      1  John    B     D        NaN
2      2  John    A     C        4.4
3      3  John    B     D        NaN
4      4  John    A     C        4.4
5      5  John    B     D        NaN
6      6  John    B     D        NaN
7      7  John    A     C        4.4
8      8  John    B     D        NaN
9      9  John    A     C        4.4

If change data working nice:

df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'John', 'John','John', 'John','John','John','John','John','John']
                  ,'City': ['A', 'B', 'A','B', 'A','B','B','A','B','A'],'City2': ['C', 'C', 'C','D', 'C','D','D','C','D','C']})

print (df)
   Value  Name City City2
0      0  John    A     C
1      1  John    B     C <- one row for C for group John, B
2      2  John    A     C
3      3  John    B     D
4      4  John    A     C
5      5  John    B     D
6      6  John    B     D
7      7  John    A     C
8      8  John    B     D
9      9  John    A     C

df['C_Average'] = (df.assign(Value = df['Value'].where(df['City2']== 'C'))
                     .groupby(['Name','City'])['Value']
                     .transform(lambda v: v.nsmallest(11).mean()))

print (df)
   Value  Name City City2  C_Average
0      0  John    A     C        4.4
1      1  John    B     C        1.0
2      2  John    A     C        4.4
3      3  John    B     D        1.0
4      4  John    A     C        4.4
5      5  John    B     D        1.0
6      6  John    B     D        1.0
7      7  John    A     C        4.4
8      8  John    B     D        1.0
9      9  John    A     C        4.4

There is difference with your solution:

df['C_Average'] = df[df['City2'] == 'C'].groupby(['Name','City'])['Value'].transform(lambda v: v.nsmallest(11).mean())

print (df)
   Value  Name City City2  C_Average
0      0  John    A     C        4.4
1      1  John    B     C        1.0
2      2  John    A     C        4.4
3      3  John    B     D        NaN
4      4  John    A     C        4.4
5      5  John    B     D        NaN
6      6  John    B     D        NaN
7      7  John    A     C        4.4
8      8  John    B     D        NaN
9      9  John    A     C        4.4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks @jezrael, but i was actually trying to get the whole column to show 4.4 – SOK Apr 21 '21 at 12:51
  • 1
    @SOK - I try expalin why get NaN in sample data - or need replace group `John, B` by values of group `John, A` ? – jezrael Apr 21 '21 at 12:56
  • 1
    Ah yes thanks so much! makes sense now given my example was a bit off haha! – SOK Apr 21 '21 at 13:02
  • 1
    @SOK - Ya, first I was surprise why not working too, but small sample data, so not problem find whats happen here. ;) – jezrael Apr 21 '21 at 13:04
  • hmm im actually getting a `ValueError: Length mismatch: Expected axis has 17510 elements, new values have 52310 elements` when applying it to my full dataframe so i might have an error somewhere – SOK Apr 21 '21 at 13:16
  • @SOK - Is possible missing values for grouping columns, e g here `['Name','City']`? – jezrael Apr 21 '21 at 13:17
  • Ah yes there is some missing `City` Values – SOK Apr 21 '21 at 13:17
  • why are you using `nsmallest(11)` if you would just like the mean? – rhug123 Apr 21 '21 at 13:18
  • I am trying to find the smallest 11 results of the group by and then find the mean – SOK Apr 21 '21 at 13:20
  • 2
    @SOK - I think if possible upgrade it should solve [this](https://stackoverflow.com/a/61922965/2901002) – jezrael Apr 21 '21 at 13:20