I would like to group a DataFrame by partial substrings. This is a sample .csv file:
GridCode,Key
1000,Colour
1000,Colours
1001,Behaviours
1001,Behaviour
1002,Favourite
1003,COLORS
1004,Honours
What I did so far is importing the file as df = pd.read_csv(sample.csv)
, and then I put all the strings to lowercases with df['Key'] = df['Key'].str.lower()
. The first thing I tried is groupby by GridCode and Key with:
g = df.groupby([df['GridCode'],df['Key']]).size()
then unstack and fill:
d = g.unstack().fillna(0)
and the resulting DataFrame is:
Key behaviour behaviours colors colour colours favourite honours
GridCode
1000 0 0 0 1 1 0 0
1001 1 1 0 0 0 0 0
1002 0 0 0 0 0 1 0
1003 0 0 1 0 0 0 0
1004 0 0 0 0 0 0 1
Now what I would like to do is to group only strings containing the substring 'our', in this case avoiding only the colors Key, creating a new column with the desired substring. The expected result would be like:
Key 'our'
GridCode
1000 2
1001 2
1002 1
1003 0
1004 1
I tried also to mask the DataFrame with masked = df['Key'].str.contains('our')
, then df1 = df[mask]
, but I can't figured out how to make a new column with the new groupby counts. Any help would be really appreciated.