3

I would like to group a DataFrame by partial substrings. This is a sample .csv file:

GridCode,Key
1000,Colour
1000,Colours
1001,Behaviours
1001,Behaviour
1002,Favourite
1003,COLORS
1004,Honours

What I did so far is importing the file as df = pd.read_csv(sample.csv), and then I put all the strings to lowercases with df['Key'] = df['Key'].str.lower(). The first thing I tried is groupby by GridCode and Key with:

g = df.groupby([df['GridCode'],df['Key']]).size()

then unstack and fill:

d = g.unstack().fillna(0)

and the resulting DataFrame is:

Key       behaviour  behaviours  colors  colour  colours  favourite  honours
GridCode                                                                    
1000              0           0       0       1        1          0        0
1001              1           1       0       0        0          0        0
1002              0           0       0       0        0          1        0
1003              0           0       1       0        0          0        0
1004              0           0       0       0        0          0        1

Now what I would like to do is to group only strings containing the substring 'our', in this case avoiding only the colors Key, creating a new column with the desired substring. The expected result would be like:

Key       'our'
GridCode                                                                    
1000        2              
1001        2
1002        1
1003        0
1004        1

I tried also to mask the DataFrame with masked = df['Key'].str.contains('our'), then df1 = df[mask], but I can't figured out how to make a new column with the new groupby counts. Any help would be really appreciated.

Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122

1 Answers1

6
>>> import re  # for the re.IGNORECASE flag
>>> df['Key'].str.contains('our', re.IGNORECASE).groupby(df['GridCode']).sum()
GridCode
1000        2
1001        2
1002        1
1003        0
1004        1
Name: Key, dtype: float64

also, instead of

df.groupby([df['GridCode'],df['Key']])

it is better to do:

df.groupby(['GridCode', 'Key'])
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • 4
    `.str.contains` has `case` parameter which seems to do exactly what `re.IGNORECASE` is supposed to do. So this should also work: `df['Key'].str.contains('our', case=False)` – Primer Feb 13 '15 at 13:10