2

I have a simple dataframe like this:

df = pd.DataFrame({'class':['a','b','c','d','e'],
                  'name':['Adi','leon','adi','leo','andy'],
                  'age':['9','8','9','9','8'],
                   'score':['40','90','35','95','85']})

then the result is like this

 class  name   age  score
    a   Adi     9   40
    b   leon    8   90
    a   adi     9   35
    d   leo     9   95
    e   andy    8   85

how can I combine the row named 'Adi' with 'adi' in the same column while he is only one person and the score 'Adi' is 75, not 40 and 35

buhtz
  • 10,774
  • 18
  • 76
  • 149

2 Answers2

1

You could use pandas.DataFrame.groupby and pandas.DataFrame.aggregate after first making the name column lowercase:

import pandas as pd

df = pd.DataFrame({
    'class': ['a', 'b', 'c', 'd', 'e'],
    'name': ['Adi', 'leon', 'adi', 'leo', 'andy'],
    'age': ['9', '8', '9', '9', '8'],
    'score': ['40', '90', '35', '95', '85']
})
df['name'] = df['name'].str.lower()
df['score'] = df['score'].astype(int)
aggregate_funcs = {
    'class': lambda s: ', '.join(set(s)),
    'age': lambda s: ', '.join(set(s)),
    'score': sum
}
df = df.groupby(df['name']).aggregate(aggregate_funcs)
print(df)

Output:

     class age  score
name                 
adi   c, a   9     75
andy     e   8     85
leo      d   9     95
leon     b   8     90
Sash Sinha
  • 18,743
  • 3
  • 23
  • 40
  • Regarding [your earlier comment](https://stackoverflow.com/questions/70788682/#comment125148633_70788682) (question got deleted, can't reply there): That's not O(1) space but O(n) space. For example for `nums = list(range(1, 10**6))`, `tracemalloc` reports you use 28 MB. As expected for a million `int` objects each taking 28 bytes. – Kelly Bundy Jan 26 '22 at 20:30
  • Yeah sorry, I meant `O(1)` in terms of extra additional space, since `nums` has to be passed as an argument. – Sash Sinha Jan 27 '22 at 09:08
  • Not sure how you mean that. It isn't O(1) in terms of "extra additional" space. Unless you're defining that in a way I don't see? :-) You are creating 28 MB of additional data there. I have two ideas for similar solutions that really are O(1) extra space, one of them able to restore the data in the end (which I think is the least one should do in the real world... if you tried yours at work, I kinda doubt you'd get an LGTM :-) – Kelly Bundy Jan 27 '22 at 12:18
  • Everything that you said makes sense, it would be O(N) in terms of space. I have however seen a few questions relating to API design for image processing in the interview question back, in terms of the manipulation of a 2D matrix of 1, 0's representing black and white pixels, like dilating, etc. In these questions, the author mentions if the candidate talks about the possibility of creating two separate functions i.e., one that is destructive on the input image (and the reasons for wanting one), and another one that makes and returns a modified copy, it is a positive signal. – Sash Sinha Jan 27 '22 at 12:45
  • I agree with that, although I'd say that's rather different, as the output in those cases is the modified matrix, right? I.e., modifying it can be the *desired* effect. In our case, on the other hand, the output is a single int, and your code modifies the input list not as the *desired* effect but as a *side* effect. *That's* what I think makes it not so good. – Kelly Bundy Jan 27 '22 at 13:42
  • Meh... the first of my two ideas doesn't work. I thought I could maybe mark with just `True` instead. I.e., remember the occurrence of a number `num` by setting `nums[num - 1] = True` in a forwards-loop not overwriting later values and then a backwards-loop. But doesn't work. The second idea is to use `ctypes` to mess with the bytes of the `list` object similar to how I [just did for something else](https://stackoverflow.com/q/70882092/12671057). I think it would work, but it'd be messy and I'm not motivated enough. Btw let me know if you want us to delete our comments here. – Kelly Bundy Jan 27 '22 at 17:32
1

drop_duplicates() is the best way if you are using pandas

df['name'] = df['name'].str.lower()
df['score'] = df['score'].astype(int)
df['score'] = df['score'].groupby(df['name']).transform(sum)
df.drop_duplicates(subset='name',keep='first',inplace=True)

output:

  class  name age  score
0     a   adi   9     75
1     b  leon   8     90
3     d   leo   9     95
4     e  andy   8     85

you will have this output if you set keep='last':

  class  name age  score
1     b  leon   8     90
2     c   adi   9     75
3     d   leo   9     95
4     e  andy   8     85
Omar EL KHAL
  • 151
  • 4
  • 13