Imagine i have the following dataframe:
import pandas as pd
df = pd.DataFrame({'col1': ['a','b','c','d','e','f','g','h','i','j','k','l'], 'col2': [1,1,1,2,2,3,3,3,4,5,5,6]})
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
8 i 4
9 j 5
10 k 5
11 l 6
if i use this code:
df[df.col2.isin(df.groupby('col2').size().head(3).index)]
i can retrieve the 3 most frecuent categories in col2
.
EDIT:
What i would like to do is to filter the dataframe in such a way that only the most frecuent categories of col2
remains in col2
. Then i would like to create dummy columns for each category, indicating how many records of the same category there are per category and per letter in col1
.
This would be the resulting dataframe:
col1 col2_1 col2_2 col2_3 rest_count
0 a 1 0 0 0
1 b 1 0 0 0
2 c 1 0 0 0
3 d 0 1 0 0
4 e 0 1 0 0
5 f 0 0 1 0
6 g 0 0 1 0
7 h 0 0 1 0
8 i 0 0 0 1
9 j 0 0 0 1
10 k 0 0 0 1
11 l 0 0 0 1
How can i store a count of the rest of categories way in a newly created column rest_count
??
Thanks in advance