1

Sort in groupby does not work the way I thought it would. In the following example, I do not want to group "USA" together because there is one row of "Russia".

from io import StringIO

myst="""india, 905034 , 19:44   
USA, 905094  , 19:33
Russia,  905154 ,   21:56
USA, 345345, 45:55
USA, 34535, 65:45
"""
u_cols=['country', 'index', 'current_tm']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)

When I use groupby I get the following:

df.groupby('country', sort=False).size()

country
india     1
USA       3
Russia    1
dtype: int64

Is there anyway I can get results something like this...

country
india     1
USA       1
Russia    1
USA       2
shantanuo
  • 31,689
  • 78
  • 245
  • 403
  • 1
    Yes, pandas groupby does not work like itertools. You can try the trick [here](https://stackoverflow.com/questions/40802800/pandas-dataframe-how-to-groupby-consecutive-values) – ayhan May 26 '18 at 11:20

2 Answers2

4

You could try this bit of code instead of a direct groupby:

country = [] #initialising lists
count = []
for i, g in df.groupby([(df.country != df.country.shift()).cumsum()]): #Creating a list that increases by 1 for every time a unique value appears in the dataframe country column.
    country.append(g.country.tolist()[0]) #Adding the name of country to list.
    count.append(len(g.country.tolist())) #Adding the number of times that country appears to list.

pd.DataFrame(data = {'country': country, 'count':count}) #Binding the lists all into a dataframe.

This df.groupby([(df.country != df.country.shift()).cumsum()]) creates a dataframe that gives a unique number (cumulatively) to every change of country in the country column.

In the for loop, i represents the unique cumulative number assigned to each country appearance and g represents the corresponding full row(s) from your original dataframe.

g.country.tolist() outputs a list of the country names for each unique appearance (aka i) i.e.

['india']
['USA']
['Russia']
['USA', 'USA']

for your given data.

Therefore, the first item is the name of the country and the length represents the number of appearances. This info can then be (recorded in a list and then) put together into a dataframe and give the output you require.

You could also use list comprehensions rather than the for loop:

cumulative_df = df.groupby([(df.country != df.country.shift()).cumsum()]) #The cumulative count dataframe
country = [g.country.tolist()[0]  for i,g in  cumulative_df] #List comprehension for getting country names.
count = [len(g.country.tolist())  for i,g in  cumulative_df] #List comprehension for getting count for each country.

Reference: Pandas DataFrame: How to groupby consecutive values

Tane
  • 152
  • 9
2

Using the trick given in @user2285236 's comment

df['Group'] = (df.country != df.country.shift()).cumsum()
df.groupby(['country', 'Group'], sort=False).size()
phi
  • 10,572
  • 3
  • 21
  • 30