39

I'm trying to figure out how to count by number of rows per unique pair of columns (ip, useragent), e.g.

d = pd.DataFrame({'ip': ['192.168.0.1', '192.168.0.1', '192.168.0.1', '192.168.0.2'], 'useragent': ['a', 'a', 'b', 'b']})

     ip              useragent
0    192.168.0.1     a
1    192.168.0.1     a
2    192.168.0.1     b
3    192.168.0.2     b

To produce:

ip           useragent  
192.168.0.1  a           2
192.168.0.1  b           1
192.168.0.2  b           1

Ideas?

JJJ
  • 1,009
  • 6
  • 19
  • 31
barnybug
  • 643
  • 1
  • 5
  • 8

2 Answers2

64

If you use groupby, you will get what you want.

d.groupby(['ip', 'useragent']).size()

produces:

ip          useragent               
192.168.0.1 a           2
            b           1
192.168.0.2 b           1
Matti John
  • 19,329
  • 7
  • 41
  • 39
  • Great, thanks. Missed the final step on the groupby/count/[]. – barnybug Dec 01 '12 at 22:31
  • 2
    For me, that just gives `AttributeError: 'DataFrame' object has no attribute 'size'`. – Anaphory Mar 11 '15 at 11:02
  • 1
    This does not work for me. I get ` MultiIndex: 0 entries Empty DataFrame` Pandas Version 15.2 – feinmann Aug 19 '15 at 15:17
  • 9
    Got it: `d.groupby(['ip', 'useragent']).size()` does it :) – feinmann Aug 19 '15 at 15:23
  • 1
    What if I want to count the number of unique values in each range? i.e.: in range "192.28.0.1" there are 2 unique values (a,b). In range "192.168.0.2", there is 1 unique value which is b. How could I write my code to get it? – weefwefwqg3 Feb 20 '17 at 03:01
  • to answer weefwefwqg3: `d.groupby(['ip'])['useragent'].nunique()` – Sander van den Oord Jun 22 '17 at 12:00
  • For me `print(d.groupby(['ip', 'useragent']).count())` prints `Empty DataFrame Columns: [] Index: [(192.168.0.1, a), (192.168.0.1, b), (192.168.0.2, b)]` – Youda008 Oct 21 '18 at 16:14
  • @Youda008 this is quite an old answer, it should be `.size()` now instead of `.count()` – Matti John Oct 22 '18 at 08:19
  • Well `size` just produces Series where index is a string that concatenates the name of columns sent to groupby, that's not very practical. What we need is DataFrame whose first columns are the columns from groupby argument and last column is the count of occurences of these pairs, tripples, ... – Youda008 Oct 22 '18 at 09:01
  • @Youda008 `d.groupby(['ip', 'useragent']).size().reset_index()` will produce a dataframe – Matti John Oct 22 '18 at 10:30
  • can you make $useragent$ as the column? – Albert Chen Jan 21 '20 at 17:11
7
print(d.groupby(['ip', 'useragent']).size().reset_index().rename(columns={0:''}))

gives:

            ip useragent   
0  192.168.0.1         a  2
1  192.168.0.1         b  1
2  192.168.0.2         b  1

Another nice option might be pandas.crosstab:

print(pd.crosstab(d.ip, d.useragent) )
print('\nsome cosmetics:')
print(pd.crosstab(d.ip, d.useragent).reset_index().rename_axis('',axis='columns') )

gives:

useragent    a  b
ip               
192.168.0.1  2  1
192.168.0.2  0  1

some cosmetics:
            ip  a  b
0  192.168.0.1  2  1
1  192.168.0.2  0  1
Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58
  • I second using groupby. I ran a test and crosstab is much slower than groupby (e.g. in my test groupby took 0.02 seconds whereas crosstab took 2.3 seconds for a dataframe of 10K records). – pegah May 26 '20 at 16:19
  • @pegah: I am very surprised! You can simply call the commands with the `%time` magic command in front (using ipython). I get runtimes of 16.5ms vs 18.4ms. In general, the more specific function (here `crosstab`) can be expected to perform better. Especially a runtime difference of a factor of 100 for similar calls in the same library should make one suspicious. Did you maybe calculate `crosstab` on a full `df` and select some columns after that, while you calculated the group by only on the selected columns? – Markus Dutschke May 27 '20 at 05:54