pandas - groupby by partial string

Question

I would like to group a DataFrame by partial substrings. This is a sample .csv file:

GridCode,Key
1000,Colour
1000,Colours
1001,Behaviours
1001,Behaviour
1002,Favourite
1003,COLORS
1004,Honours

What I did so far is importing the file as df = pd.read_csv(sample.csv), and then I put all the strings to lowercases with df['Key'] = df['Key'].str.lower(). The first thing I tried is groupby by GridCode and Key with:

g = df.groupby([df['GridCode'],df['Key']]).size()

then unstack and fill:

d = g.unstack().fillna(0)

and the resulting DataFrame is:

Key       behaviour  behaviours  colors  colour  colours  favourite  honours
GridCode                                                                    
1000              0           0       0       1        1          0        0
1001              1           1       0       0        0          0        0
1002              0           0       0       0        0          1        0
1003              0           0       1       0        0          0        0
1004              0           0       0       0        0          0        1

Now what I would like to do is to group only strings containing the substring 'our', in this case avoiding only the colors Key, creating a new column with the desired substring. The expected result would be like:

Key       'our'
GridCode                                                                    
1000        2              
1001        2
1002        1
1003        0
1004        1

I tried also to mask the DataFrame with masked = df['Key'].str.contains('our'), then df1 = df[mask], but I can't figured out how to make a new column with the new groupby counts. Any help would be really appreciated.

behzad.nouri · Accepted Answer · 2015-02-13T12:13:28.193

6

>>> import re  # for the re.IGNORECASE flag
>>> df['Key'].str.contains('our', re.IGNORECASE).groupby(df['GridCode']).sum()
GridCode
1000        2
1001        2
1002        1
1003        0
1004        1
Name: Key, dtype: float64

also, instead of

df.groupby([df['GridCode'],df['Key']])

it is better to do:

df.groupby(['GridCode', 'Key'])

edited Feb 13 '15 at 12:13

answered Feb 13 '15 at 12:05

behzad.nouri

74,723
18
126
124

4

`.str.contains` has `case` parameter which seems to do exactly what `re.IGNORECASE` is supposed to do. So this should also work: `df['Key'].str.contains('our', case=False)` – Primer Feb 13 '15 at 13:10

pandas - groupby by partial string

1 Answers1

Linked