3

I have data

1        member_id  application_name  active_seconds 
2           192180             Opera   6
3           192180             Opera   7
4           192180             Chrome  243
5           5433112            Chrome   52
6           5433112            Opera   34
7           5433112            Chrome 465

I need to group it by count of using application_name and quantity of active_seconds

I use print df.groupby(['member_id', 'application_name']).count() but I get result to active_second, and

print df.groupby(['member_id', 'application_name'])['active_seconds'].count() 

works uncorrectly. What I do wrong?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
NineWasps
  • 2,081
  • 8
  • 28
  • 45

1 Answers1

4

I think you need aggregate:

df1 = df.groupby(['member_id', 'application_name'])
        .agg({'application_name':len, 'active_seconds':sum}) 

print (df1)
                            active_seconds  application_name
member_id application_name                                  
192180    Chrome                       243                 1
          Opera                         13                 2
5433112   Chrome                       517                 2
          Opera                         34                 1

If need reset_index, first rename columns (because ValueError: cannot insert application_name, already exists):

df1 = df.groupby(['member_id', 'application_name'])
        .agg({'application_name':len, 'active_seconds':sum})
        .rename(columns={'active_seconds':'count_sec','application_name':'sum_app'})
        .reset_index() 

print (df1)
   member_id application_name  count_sec  sum_app
0     192180           Chrome        243        1
1     192180            Opera         13        2
2    5433112           Chrome        517        2
3    5433112            Opera         34        1

Timings:

In [208]: %timeit df.groupby(['member_id', 'application_name']).agg({'application_name':len, 'active_seconds':sum}).rename(columns={'active_seconds':'count_sec','application_name':'sum_app'}).reset_index()
10 loops, best of 3: 93.6 ms per loop

In [209]: %timeit (f1(df))
10 loops, best of 3: 127 ms per loop

Code for testing:

import pandas as pd

df = pd.DataFrame({'member_id': {0: 192180, 1: 192180, 2: 192180, 3: 5433112, 4: 5433112, 5: 5433112}, 
                   'active_seconds': {0: 6, 1: 7, 2: 243, 3: 52, 4: 34, 5: 465}, 
                   'application_name': {0: 'Opera', 1: 'Opera', 2: 'Chrome', 3: 'Chrome', 4: 'Opera', 5: 'Chrome'}})
print (df)
#   active_seconds application_name  member_id
#0               6            Opera     192180
#1               7            Opera     192180
#2             243           Chrome     192180
#3              52           Chrome    5433112
#4              34            Opera    5433112
#5             465           Chrome    5433112

df = pd.concat([df]*1000).reset_index(drop=True)
print (len(df))
#6000

df1 = df.groupby(['member_id', 'application_name']).agg({'application_name':len, 'active_seconds':sum}).rename(columns={'active_seconds':'count_sec','application_name':'sum_app'}).reset_index() 
print (df1)

def f1(df):
    a = (df.groupby(['member_id', 'application_name'])['active_seconds'].sum() )
    b = (df.groupby(['member_id', 'application_name']).size())
    return (pd.concat([a,b], axis=1, keys=['count_sec','sum_app']).reset_index())

print (f1(df))
#   member_id application_name  count_sec  sum_app
#0     192180           Chrome     243000     1000
#1     192180            Opera      13000     2000
#2    5433112           Chrome     517000     2000
#3    5433112            Opera      34000     1000
#   member_id application_name  count_sec  sum_app
#0     192180           Chrome     243000     1000
#1     192180            Opera      13000     2000
#2    5433112           Chrome     517000     2000
#3    5433112            Opera      34000     1000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • But how I can add eiter column to one df? I want to show id, application_name, countof visits and next quantity of active_seconds? – NineWasps Jun 17 '16 at 10:27
  • If I have 10 million strings, how long does it work? it don't print result after 5 minutes – NineWasps Jun 17 '16 at 10:59
  • 10 millions? it is very much. What is size of your `RAM` ? – jezrael Jun 17 '16 at 11:03
  • I check there are 4625465str, ram is 8Gb – NineWasps Jun 17 '16 at 11:05
  • It works with `groupby`, `print df.groupby(['member_id', 'application_name'])['active_seconds'].sum()` and `print df.groupby(['member_id', 'application_name']).count()` but results are the same to big file. But to train, that post in this question that works correctly – NineWasps Jun 17 '16 at 11:08
  • It work better, because [cython-optimized-aggregation-functions(]http://pandas.pydata.org/pandas-docs/stable/groupby.html#cython-optimized-aggregation-functions) are faster as `agg`. I try some test, if another solution is faster. – jezrael Jun 17 '16 at 11:12
  • Thank you, I will wait. Can you explain, why `groupby` can works uncorrectly? – NineWasps Jun 17 '16 at 11:15
  • What do you think uncorrectly? Do you think function `count` - [see](http://stackoverflow.com/a/33346694/2901002) ? – jezrael Jun 17 '16 at 11:17
  • And `len` is equal as size. – jezrael Jun 17 '16 at 11:19
  • I edit answer, but `agg` is faster as its equivalent functions. – jezrael Jun 17 '16 at 11:23
  • In my PC it works, I have 8GB RAM too. Maybe faster processor? But is is 600k rows. Maybe delete one or two `0` in `df = pd.concat([df]*100000).reset_index(drop=True)` – jezrael Jun 17 '16 at 11:28
  • it returns `Memory Error` when `df = pd.concat([df]*100).reset_index(drop=True)` – NineWasps Jun 17 '16 at 12:11
  • Sorry, I found what is problem - in my timing I use your small sample - `df = pd.DataFrame({'member_id': {0: 192180, 1: 192180, 2: 192180, 3: 5433112, 4: 5433112, 5: 5433112}, 'active_seconds': {0: 6, 1: 7, 2: 243, 3: 52, 4: 34, 5: 465}, 'application_name': {0: 'Opera', 1: 'Opera', 2: 'Chrome', 3: 'Chrome', 4: 'Opera', 5: 'Chrome'}}) print (df)`, so I use `df = pd.concat([df]*100).reset_index(drop=True)` for extended sample. Try it, I think now it will be works. – jezrael Jun 17 '16 at 12:14
  • with 100 it returns `Memory Error`, with 10 it have been already working for 30 minutes and returns nothing – NineWasps Jun 17 '16 at 13:19
  • Hmmm, it is only for testing - I edit answer how I test. – jezrael Jun 17 '16 at 13:25
  • Ok, I first run `code for testing` - now it is `6000` rows in `df`. then in ipython console run `%timeit df.groupby(['member_id', 'application_name']).agg({'application_name':len, 'active_seconds':sum}).rename(columns={'active_seconds':'count_sec','application_name':'sum_app'}).reset_index()` and `%timeit (f1(df))` .How does it work? – jezrael Jun 17 '16 at 13:27
  • Once I have similar problem and for me works close programs in `PC` which are not necessary, restart `Anaconda` (or pycharm, ...) and then it works for me. – jezrael Jun 17 '16 at 13:44
  • to my data returns uncorrectly mean, it should return `243` and `13` to id `192180` but it returns to me `243` and `67`. instead to do `6+7` it union this values – NineWasps Jun 17 '16 at 14:14