Efficient grouping with count of items

Question

Here is how my dataset looks like:

Name | Country
---------------
Alex | USA
Tony | DEU
Alex | GBR
Alex | USA

I am trying to get something like this out, essentially grouping and counting:

Name | Country
---------------
Alex | {USA:2,GBR:1}
Tony | {DEU:1}

Works, but slow on LARGE datasets

Here is my code that does work on smaller dfs, but takes forever on bigger dfs (mine is around 14 million rows). I also use the multiprocessing module to speed up, but it doesn't help much:

def countNames(x):
    return dict(Counter(x))

def aggregate(df_full,nameList):
    df_list = []
    for q in nameList:
        df = df_full[df_full['Name']==q]
        df_list.append(df.groupby('Name')['Country'].apply(lambda x: str(countNames(x))).to_frame().reset_index()) 
    return pd.concat(df_list)

df = pd.DataFrame({'Name':['Alex','Tony','Alex','Alex'], 
                'Country':['USA','GBR','USA','DEU']})[['Name','Country']]

aggregate(df,df.Name.unique())

Is there anything that can speed up the internal logic (except for running with multiprocessing)?

Yes, don't use a `for` loop. Do you really want to store data as `{USA:2,GBR:1}` e.g. a dictionary in a cell? You lose a lot of Pandas functionality doing that — roganjosh, Mar 02 '18 at 23:52
Which is fine, since the output needs to be written into a CSV. Any suggestions how to replace the `for` loop? — ka_boom, Mar 02 '18 at 23:53

piRSquared · Accepted Answer · 2018-03-03T00:09:47.377

3

This is essentially a cross tabulation. You said "something like this" which implies that you aren't quite sure what the output should be.

Option 1
Group by and value_counts

df.groupby('Name').Country.value_counts()

Name  Country
Alex  USA        2
      GBR        1
Tony  DEU        1
Name: Country, dtype: int64

To get your specified output:

pd.Series({
    name: pd.value_counts(d).to_dict()
    for name, d in df.groupby('Name').Country
}).rename_axis('Name').reset_index(name='Country')

   Name               Country
0  Alex  {'USA': 2, 'GBR': 1}
1  Tony            {'DEU': 1}

Option 2
However, I'd prefer these representations. Which we can see a number of ways to do this in the answer to question # 9 in this answer

pd.crosstab(df.Name, df.Country)

Country  DEU  GBR  USA
Name                  
Alex       0    1    2
Tony       1    0    0

edited Mar 03 '18 at 00:09

answered Mar 03 '18 at 00:05

piRSquared

285,575
57
475
624

Nice way of getting the series of interest indeed. – Anton vBR Mar 03 '18 at 00:09
I love Option 2 a lot, but I encounter this error: `TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed`. Any idea why? – ka_boom Mar 03 '18 at 02:04
did you switch `name` and `d`? – piRSquared Mar 03 '18 at 04:12
@piRSquared, I am talking about Option 2, which is the crosstab function. – ka_boom Mar 03 '18 at 06:20
@ka_boom yeah, that's weird. I'd need to see the data your using and the exact commands your running to understand why that is happening. – piRSquared Mar 03 '18 at 06:46

Anton vBR · Answer 2 · 2018-03-03T00:15:51.263

Are you looking for this?

import pandas as pd

df = pd.DataFrame({'Name':['Alex','Tony','Alex','Alex'], 
                'Country':['USA','GBR','USA','DEU']})[['Name','Country']]

df = (df.groupby('Name')['Country']
         .apply(lambda x: str(x.value_counts().to_dict()))
         .reset_index(name='Country'))

Returns:

   Name               Country
0  Alex  {'USA': 2, 'DEU': 1}
1  Tony            {'GBR': 1}

jpp · Answer 3 · 2018-03-03T01:12:58.293

For an O(n) complexity solution, use collections.Counter.

from collections import Counter, defaultdict
import pandas as pd

df = pd.DataFrame({'Name':['Alex','Tony','Alex','Alex'], 
                   'Country':['USA','GBR','USA','DEU']})[['Name','Country']]

c = Counter(map(tuple, df.values))

# Counter({('Alex', 'DEU'): 1, ('Alex', 'USA'): 2, ('Tony', 'GBR'): 1})

Dictionary result

You can then get a Name -> Country dictionary mapping via collections.defaultdict. I would not put dictionaries in a pandas dataframe, it's not designed for this purpose.

tree = lambda: defaultdict(tree)
d = tree()

for k, v in c.items():
    d[k[0]][k[1]] = v

for k, v in d.items():
    print(k, v)

# Alex defaultdict(<function <lambda>>, {'USA': 2, 'DEU': 1})
# Tony defaultdict(<function <lambda>>, {'GBR': 1})

Dataframe result

For display purposes, you can build a dataframe directly from the defaultdict:

res_df = pd.DataFrame.from_dict(d, orient='index').fillna(0)

#       USA  DEU  GBR
# Alex  2.0  1.0  0.0
# Tony  0.0  0.0  1.0

Efficient grouping with count of items

Works, but slow on LARGE datasets

3 Answers3