1

I have this DataFrame to groupby key:

df = pd.DataFrame({
                   'key': ['1', '1', '1', '2', '2', '3', '3', '4', '4', '5'],
                   'data1': [['A', 'B', 'C'], 'D', 'P', 'E', ['F', 'G', 'H'], ['I', 'J'], ['K', 'L'], 'M', 'N', 'O']
                   'data2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
                 })
df

I want to make the groupby key and sum data2, it's ok for this part. But concerning data1, I want to :

  1. If a list doesn't exist yet:
    • Single values don't change when key was not duplicated
    • Single values assigned to a key are combined into a new list
  2. If a list already exist:
    • Other single values are append to it
    • Other lists values are append to it

The resulting DataFrame should then be :

dfgood = pd.DataFrame({
                   'key': ['1', '2', '3', '4', '5'],
                   'data1': [['A', 'B', 'C', 'D', 'P'], ['F', 'G', 'H', 'E'], ['I', 'J', 'K', 'L'], ['M', 'N'], 'O']
                   'data2': [6, 9, 13, 17, 10]
                 })
dfgood

In fact, I don't really care about the order of data1 values into the lists, it could also be any structure that keep them together, even a string with separators or a set, if it's easier to make it go the way you think best to do this.

I thought about two solutions :

  1. Going that way :
dfgood = df.groupby('key', as_index=False).agg({
            'data1' : lambda x: x.iloc[0].append(x.iloc[1]) if type(x.iloc[0])==list else list(x),
            'data2' : sum,
            })
dfgood

It doesn't work because of index out of range in x.iloc[1]. I also tried, because data1 was organized like this in another groupby from the question on this link:

dfgood = df.groupby('key', as_index=False).agg({
            'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
            'data2' : sum,
            })
dfgood

But it's creating new lists from preexisting lists or values and not appending data to already existing lists.

  1. Another way to do it, but I think it's more complicated and there should be a better or faster solution :
    • Turning data1 lists and single values into individual series with apply,
    • use wide_to_long to keep single values for each key,
    • Then groupby applying :
dfgood = df.groupby('key', as_index=False).agg({
            'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
            'data2' : sum,
            })
dfgood

I think my problem is that I don't know how to use lambdas correctly and I try stupid things like x.iloc[1] in the previous example. I've looked at a lot of tutorial about lambdas, but it's still fuzzy in my mind.

Lemisourd
  • 135
  • 1
  • 11

2 Answers2

1

You could explode to get individual rows, then aggregate again with groupby+agg after taking care of masking the duplicated values in data2 (to avoid summing duplicates):

(df.explode('data1')
   .assign(data2=lambda d: d['data2'].mask(d.duplicated(['key', 'data2']), 0))
   .groupby('key')
   .agg({'data1': list, 'data2': 'sum'})
)

output:

               data1  data2
key                        
1    [A, B, C, D, P]      6
2       [E, F, G, H]      9
3       [I, J, K, L]     13
4             [M, N]     17
5                [O]     10
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I have more than one Series with sums to do, I tried to assign another mask or to modify the mask so it could deal with it, but I didn't managed to find a solution. But `explode()` is new to me, I did some tests to see what happens : new thing learned ! – Lemisourd Jan 07 '22 at 13:46
  • 1
    Then a nice approach might be to split the dataframe in two (the lists and the series) handle them separately, then join ;) – mozway Jan 07 '22 at 13:48
  • Ah yes, doing the assignment for the different DFs before combining is a great idea. – Lemisourd Jan 07 '22 at 13:51
1

There is problem combinations lists with scalars, possible solution is create first lists form scalars and then flatten them in groupby.agg:

dfgood = (df.assign(data1 = df['data1'].apply(lambda y: y if isinstance(y, list) else [y]))
            .groupby('key', as_index=False).agg({
            'data1' : lambda x: [z for y in x for z in y],
            'data2' : sum,
            })
            )
print (dfgood)
  key            data1  data2
0   1  [A, B, C, D, P]      6
1   2     [E, F, G, H]      9
2   3     [I, J, K, L]     13
3   4           [M, N]     17
4   5              [O]     10

Another idea is use flatten function for flatten only lists, not strings:

#https://stackoverflow.com/a/5286571/2901002
def flatten(foo):
    for x in foo:
        if hasattr(x, '__iter__') and not isinstance(x, str):
            for y in flatten(x):
                yield y
        else:
            yield x

dfgood = (df.groupby('key', as_index=False).agg({
            'data1' : lambda x: list(flatten(x)),
            'data2' : sum}))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    note that you can use `sum` instead of `lambda x: [z for y in x for z in y]` ;) – mozway Jan 07 '22 at 10:49
  • 1
    @mozway - Don't use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice. – jezrael Jan 07 '22 at 10:50
  • 1
    I see your point, that's true if there are many elements – mozway Jan 07 '22 at 10:51
  • 1
    @mozway - ya, I rather avoid it. – jezrael Jan 07 '22 at 10:51
  • 1
    @jezrael - The flattening works well, no problems of duplicates to deal with ! All your answers where helpful, but this one was better at solving my issue (because I deal with a lot more Series to aggregate with different functions). – Lemisourd Jan 07 '22 at 13:50