1

Imagine i have the following dataframe:

import pandas as pd

df = pd.DataFrame({'col1': ['a','b','c','d','e','f','g','h','i','j','k','l'], 'col2': [1,1,1,2,2,3,3,3,4,5,5,6]})

    col1    col2
0      a       1
1      b       1
2      c       1
3      d       2
4      e       2
5      f       3
6      g       3
7      h       3
8      i       4
9      j       5
10     k       5
11     l       6

if i use this code:

df[df.col2.isin(df.groupby('col2').size().head(3).index)]

i can retrieve the 3 most frecuent categories in col2.

EDIT:

What i would like to do is to filter the dataframe in such a way that only the most frecuent categories of col2 remains in col2. Then i would like to create dummy columns for each category, indicating how many records of the same category there are per category and per letter in col1.

This would be the resulting dataframe:

    col1    col2_1  col2_2  col2_3  rest_count
0      a         1       0       0           0
1      b         1       0       0           0
2      c         1       0       0           0
3      d         0       1       0           0
4      e         0       1       0           0
5      f         0       0       1           0
6      g         0       0       1           0
7      h         0       0       1           0
8      i         0       0       0           1
9      j         0       0       0           1       
10     k         0       0       0           1
11     l         0       0       0           1

How can i store a count of the rest of categories way in a newly created column rest_count??

Thanks in advance

Miguel 2488
  • 1,410
  • 1
  • 20
  • 41

2 Answers2

1
def check_top(row, df_top):
    """create extra mask column called top3
    it will be used to filter out col2 values"""

    if row.col2 in df_top:
        row['top3'] = True
    else:
        row['top3'] = False
    return row

def update_cols(row):
    """update col2 and col3 values depending on top3 value"""

    if row['top3'] == True:
        row['col3'] = None
    else:
        row['col2'] = None
    return row

# get top3 values
df_top = df.groupby('col2').size().head(3).index
df = df.apply(lambda row: check_top(row, df_top), axis=1) 

# create col3 column
df['col3'] = df['col2']

df = df.apply(lambda row: update_cols(row), axis=1)

# select the columns that you need
df = df[['col1', 'col2', 'col3']]
naivepredictor
  • 898
  • 4
  • 14
  • Hi, thank you for your answer, please see the edit of the question. – Miguel 2488 Jan 10 '19 at 10:29
  • Hi, thanks for the edit, your answer is near to what i need, but is missing the dummies, and it produces NaN's in col3 as well as in col 2 for the values tha are in col3 but not in col2. If you could include the dummies and replace th NaN's by zeros it should be perfect. Thank you!! – Miguel 2488 Jan 10 '19 at 10:39
1

Use:

#get top values
v = df.groupby('col2').size().head(3).index
#create new DataFrame by compare each value
df1 = pd.concat([(df.col2 == x).astype(int) for x in v], axis=1)
#create counter for columns names
df1.columns = ['{}_{}'.format(x, i) for i, x in enumerate(df1.columns, 1)]
#join together with original
df = df.join(df1)
#add column for remain values
df['rest_count'] = (~df.col2.isin(v)).astype(int)
print (df)
   col1  col2  col2_1  col2_2  col2_3  rest_count
0     a     1       1       0       0           0
1     b     1       1       0       0           0
2     c     1       1       0       0           0
3     d     2       0       1       0           0
4     e     2       0       1       0           0
5     f     3       0       0       1           0
6     g     3       0       0       1           0
7     h     3       0       0       1           0
8     i     4       0       0       0           1
9     j     5       0       0       0           1
10    k     5       0       0       0           1
11    l     6       0       0       0           1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi Jezrael, thank you very much for your answer. Could you just provide explanations about what your code is doing please?? for the sake of understanding&learning – Miguel 2488 Jan 10 '19 at 10:38
  • 1
    @Miguel2488 - I hope explanation is good, if something not clear let me know. Thanks. – jezrael Jan 10 '19 at 10:40
  • it seems good, i understand everything except for the `create counter for column names` line and i also would like to know what is the `~` character for? Thank you again!! – Miguel 2488 Jan 10 '19 at 10:43
  • 1
    @Miguel2488 - yes, line `df1.columns = ['{}_{}'.format(x, i) for i, x in enumerate(df1.columns, 1)]` is list comprehension with [enumerate](https://stackoverflow.com/a/22171593) and [format](https://realpython.com/python-string-formatting/) for join column name with numbers and `~` is for invert boolena mask `[True, False, True]` for `[False, True, False]`, [link](https://stackoverflow.com/q/15998188) – jezrael Jan 10 '19 at 11:13