2

Here's the data:

df = pd.DataFrame({
    'date':[1,1,2,2,2,3,3,3,4,5],
    'request':[2,2,2,3,3,2,3,3,3,3],
    'users':[1,3,7,1,7,3,4,9,7,9],
    'count':[1,1,2,3,1,3,1,2,1,1]
})

df

   count  date  request  users
0      1     1        2      1
1      1     1        2      3
2      2     2        2      7
3      3     2        3      1
4      1     2        3      7
5      3     3        2      3
6      1     3        3      4
7      2     3        3      9
8      1     4        3      7
9      1     5        3      9

The idea is to group by count and date, and convert every other column to a list of grouped values. I thought this would be as simple as calling dfgp.agg but it is not.

This is what I want to do:

   date  request   count   users
0     1        2  [1, 1]  [1, 3]
1     2        2     [2]     [7]
2     2        3  [3, 1]  [1, 7]
3     3        2     [3]     [3]
4     3        3  [1, 2]  [4, 9]
5     4        3     [1]     [7]
6     5        3     [1]     [9]

This is how I have done it:

grouped_df = df.groupby(['date', 'request'])

df_new = pd.DataFrame({ 'count' : grouped_df['count'].apply(list), 'users' : grouped_df['users'].apply(list) }).reset_index()

It works but I believe there has to be a better way... one that can work on all columns in the grouped object. For example, I should group by just date and the solution should work. My solution will rely on hardcoding the columns, that I dislike doing, so it will fail in this instance.

This is a something that has been bothering me. It should be an obvious solution but I cannot find it. Is there a better way?


Calling all my Pandas MVPs...

piRSquared
  • 285,575
  • 57
  • 475
  • 624
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Is `df.groupby(['date', 'request']).agg(lambda x: [x.tolist()])` okay, list of lists though? – Zero Jul 26 '17 at 20:02
  • @JohnGalt Haha that seems too easy... no... I am looking for a list, not list of lists :) – cs95 Jul 26 '17 at 20:03
  • At the moment, this hack would `df.groupby(['date', 'request']).agg(lambda x: [x.tolist()]).apply(lambda x: x[0], axis=1)` work. But seems clunky. – Zero Jul 26 '17 at 20:05
  • Standard Warning: Storing list-likes in cells is not an intended use case for pandas and is poorly supported. By doing so you lose all of the speedups associated with pandas, and will essentially be operating at pure python speed when dealing with those columns. There is usually a better data representation that can be used. – root Jul 26 '17 at 20:50
  • @root Yes. The use cases of such an operation are far and few, but they do exist. Will keep this in mind. – cs95 Jul 26 '17 at 20:52

3 Answers3

1
df.groupby(['request','date'])[['count','users']].agg(lambda x: ','.join(x.astype(str)))

Output:

             count users
request date            
2       1      1,1   1,3
        2        2     7
        3        3     3
3       2      3,1   1,7
        3      1,2   4,9
        4        1     7
        5        1     9
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • I guess you meant `lambda x: x.tolist()`? I would like the columns as a list. – cs95 Jul 26 '17 at 20:01
  • I guess you could do `df.groupby(['date', 'request']).agg(lambda x: [x.tolist()]).applymap(lambda x: x[0]) ` but this is still pretty ugly. – cs95 Jul 26 '17 at 20:04
  • what about `df.groupby(['date', 'request']).agg(lambda x: [x.tolist()]).apply(lambda x: x.str[0])`? – MaxU - stand with Ukraine Jul 26 '17 at 20:04
  • The fact that `df.agg` works means we are on the right track. But I'd ideally like that second apply call to disappear... somehow... hmm... – cs95 Jul 26 '17 at 20:06
1

Even Better Answer
Find where duplicates happen, split and filter accordingly

dups = df.duplicated(['request', 'date'], 'last').values
i = np.where(~dups[:-1])[0] + 1
r, d, c, u = (df[c].values for c in df)

d1 = pd.DataFrame(
    np.column_stack([r[~dups], d[~dups]]), columns=['request', 'date'])
d2 = pd.DataFrame(
    np.column_stack([np.split(c, i), np.split(u, i)]), columns=['count', 'users'])

d1.join(d2)

   date  requeset   count   users
0     1         2  [1, 1]  [1, 3]
1     2         2     [2]     [7]
2     2         3  [3, 1]  [1, 7]
3     3         2     [3]     [3]
4     3         3  [1, 2]  [4, 9]
5     4         3     [1]     [7]
6     5         3     [1]     [9]

Answer I feel good about!
Yay! defaultdict

from collections import defaultdict

d = defaultdict(list)

s = df.set_index(['date', 'request']).stack()
[d[k].append(v) for k, v in s.iteritems()];

pd.Series(d).unstack().rename_axis(['date', 'requeset']).reset_index()

   date  requeset   count   users
0     1         2  [1, 1]  [1, 3]
1     2         2     [2]     [7]
2     2         3  [3, 1]  [1, 7]
3     3         2     [3]     [3]
4     3         3  [1, 2]  [4, 9]
5     4         3     [1]     [7]
6     5         3     [1]     [9]

Old Answer

f = lambda x: pd.Series(x.values.T.tolist(), x.columns)
df.groupby(['request', 'date'])[['count', 'users']].apply(f).reset_index()

   request  date   count   users
0        2     1  [1, 1]  [1, 3]
1        2     2     [2]     [7]
2        2     3     [3]     [3]
3        3     2  [3, 1]  [1, 7]
4        3     3  [1, 2]  [4, 9]
5        3     4     [1]     [7]
6        3     5     [1]     [9]

Frustration Answer!
Shoehorning agg

from ast import liter_eval

df.groupby(['request', 'date']).agg(
    lambda x: str(list(x))
).applymap(literal_eval).reset_index()

   request  date   count   users
0        2     1  [1, 1]  [1, 3]
1        2     2     [2]     [7]
2        2     3     [3]     [3]
3        3     2  [3, 1]  [1, 7]
4        3     3  [1, 2]  [4, 9]
5        3     4     [1]     [7]
6        3     5     [1]     [9]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Another solution I attempted, but didn't mention here was `df.groupby(['date', 'request']).apply(lambda x: pd.Series((x['count'].tolist(), x['users'].tolist()))).reset_index()`. This is definitely along the same lines but much cleaner. I like this. – cs95 Jul 26 '17 at 20:13
  • Can this be tweaked to work with agg? So as to remove the need to index the columns? – cs95 Jul 26 '17 at 20:14
  • 2
    Problem with `agg` is that it checks to see if the function actually reduces. If it's an array and not a scalar, `agg` will yell at me (and I'm a sensitive snowflake). – piRSquared Jul 26 '17 at 20:23
  • Indeed. "function does not reduce" has been plaguing me all night. – cs95 Jul 26 '17 at 20:26
  • @piRSquared - why use eval when you could `df.groupby(['date', 'request']).agg(lambda x: [x.tolist()]).apply(lambda x: x[0], axis=1)`? – Zero Jul 26 '17 at 20:35
  • @JohnGalt Explored that possibility (see Scott's answer) It's really clunky. – cs95 Jul 26 '17 at 20:36
  • @cᴏʟᴅsᴘᴇᴇᴅ - check my clunky answer with apply aswell :-) – Zero Jul 26 '17 at 20:37
  • In the end I don't believe there is a way around that clunkiness. I am severely disappointed by pandas because this should be something that is more easily attainable than it currently is. With that said I believe the series answer is the least hacky one. – cs95 Jul 26 '17 at 20:39
  • Nice! That's a totally different route. Although I'll venture a guess and say your first answer would be faster. – cs95 Jul 26 '17 at 20:49
  • All the same, I appreciate your efforts as well as the others. Thanks for answering :) – cs95 Jul 26 '17 at 20:50
  • @cᴏʟᴅsᴘᴇᴇᴅ over the sample dataset, `defaultdict` answer is quicker. `3.87 ms` vs. `5.74 ms` – piRSquared Jul 26 '17 at 21:20
  • Oh? That's pleasantly surprising. Although... For larger datasets, I believe the bottleneck will be iteritems in the list comp.... possibly. I should benchmark this. – cs95 Jul 26 '17 at 21:23
  • OK, added a new faster answer. – piRSquared Jul 26 '17 at 22:18
1

Hacky way using apply, and could possible be very slow.

In [1274]: df.groupby(['date', 'request']).apply(
          lambda x: pd.Series({c: x[c].tolist() for c in ['count', 'users']}))
Out[1274]:
               count   users
date request
1    2        [1, 1]  [1, 3]
2    2           [2]     [7]
     3        [3, 1]  [1, 7]
3    2           [3]     [3]
     3        [1, 2]  [4, 9]
4    3           [1]     [7]
5    3           [1]     [9]
Zero
  • 74,117
  • 18
  • 147
  • 154