Binarize a lot of features with condition

Question

I have Pandas dataframe with hundreds of categoric features (in numbers). I want to leave only top values in columns. I do already know, that there are only 3 or 4 most frequent values in each column, but I want to select it automatically. I need two ways to do it:

1)leave only 3 most frequent values. Notion: there are no columns with 1, 2 or 3 unique values (~20 unique values in each column), so, do not consider it. If you have, for example, several third places, leave them all. For example:

#after you use value_counts() column 1
1 35
2 23
3 10
4 9
8 8
6 8

#after you use value_counts() on column 2
0 23
2 15
1 15 #two second places
4 9
5 3
6 2

#result after you use value_counts() on column 1
1 35
2 23
3 10
others 25 #9+8+8

#result after you use value_counts() on column 2
0 23
2 15
1 15
4 9
others 5 #3+2

2)leave as many values in each column as needed so that the number of remaining values is less than the number of the last values that you decided to leave. For example:

#after you use value_counts() column 1
1 35
2 23
3 10
4 3
8 2
6 1

#after you use value_counts() on column 2
0 23
2 15
1 9
4 8
5 3
6 2

#result after you use value_counts() on column 1
1 35
2 23
3 10
others 6 #3+2+1

#result after you use value_counts() on column 2
0 23
2 15
1 9
4 8
others 5 #3+2

Please, do both. Thanks.

1. Please provide sample data in a [reproducible way](https://stackoverflow.com/questions/20109391). 2. How do you deal with ties? Say, if there are three third-place features. 3. What to do if there are only two or one feature in a column? — Bill Huang, Oct 25 '20 at 12:36
You presented a sample containing a **single** column. Note that the decision whether to leave or to drop some row depends on whether value in each column belongs to most frequent values (MFV) in each column separately. So please describe how the expected solution should work if: 1. Value in some column belongs to MFV (looking at values from this column). 2. But values in other column are **not** among MFV (again, looking at values in this (other) column). — Valdi_Bo, Oct 25 '20 at 13:21
@Valdi_Bo thank you, fixed. I hope this is possible to do it. — outgush, Oct 25 '20 at 13:35

score 0 · Answer 1 · answered Oct 25 '20 at 13:39

0

Let's try a udf with your logic:

def my_count(s):
    x = s.value_counts()
    if len(x) > 3:
        ret = x.iloc[:3].copy()
        ret.loc['other'] = x.iloc[3:].sum()
    else:
        ret = x
    return ret

df[['col1']].apply(my_count)

Output:

       col1
1        35
2        23
3        10
other     6

answered Oct 25 '20 at 13:39

Quang Hoang

146,074
10
56
74

Thank you for you solution! – outgush Oct 25 '20 at 15:36

Bill Huang · Accepted Answer · 2020-10-25T14:26:21.870

I will showcase what I'd like to use myself at work with a 2-columned data. Limitation: simultaneous ties in the 2nd, 3rd, and 4th places are not collected into the same cell in this solution. You may have to further customize this behavior depending on your purpose.

Sample Data

There are 2 columns with 26 classes each. One column is categorical and the other one is numeric. The sample data is chosen deliberately to showcase the effect of ties.

import pandas as pd
import numpy as np

np.random.seed(2)  # reproducibility
df = pd.DataFrame(np.random.randint(65, 91, (1000, 2)), columns=["str", "num"])
df["str"] = list(map(chr, df["str"].values))

print(df)
    str  num
0     I   80
1     N   73
2     W   76
3     S   76
4     I   72
..   ..  ...
995   M   80
996   Q   70
997   P   66
998   I   87
999   F   83
[1000 rows x 2 columns]

Desired Function

def count_top_n(df, n_top):

    # name of output columns
    def gen_cols(ls_str):
        for s in ls_str:
            yield s
            yield f"{s}_counts"

    df_count = pd.DataFrame(np.zeros((n_top+1, df.shape[1]*2), dtype=object),
                            index=range(1, n_top+2),
                            columns=list(gen_cols(df.columns.values)))  # df.shape[1] = #cols
    # process each column
    for i, col in enumerate(df):
        # count
        tmp = df[col].value_counts()
        assert len(tmp) > n_top, f"ValueError: too few classes {len(tmp)} <= {n_top} = n_top)"

        # case 1: no ties at the 3rd place
        if tmp.iat[n_top - 1] != tmp.iat[n_top]:
            # fill in classes
            df_count.iloc[:n_top, 2*i] = tmp[:n_top].index.values
            df_count.iloc[n_top, 2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top, 2*i+1] = tmp[:n_top]
            df_count.iloc[n_top, 2*i+1] = tmp[n_top:].sum()
        
        # case 2: ties
        else:
            # new termination location
            n_top_new = (tmp >= tmp.iat[n_top]).sum()
            # fill in classes
            df_count.iloc[:n_top-1, 2*i] = tmp.iloc[:n_top-1].index.values
            df_count.iloc[n_top-1, 2*i] = list(tmp.iloc[n_top-1:n_top_new].index.values)
            df_count.iloc[n_top, 2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top-1, 2*i+1] = tmp.iloc[:n_top-1].values
            df_count.iloc[n_top-1, 2*i+1] = list(tmp.iloc[n_top-1:n_top_new].values)
            df_count.iloc[n_top, 2*i+1] = tmp.iloc[n_top_new:].values.sum()

    return df_count

Output:

A human-readable table is generated. Note that there are ties in the 2nd,3rd,and 4th places for the column str.

print(count_top_n(df, 3))
      str str_count       num num_count
1       V        52        71        51
2       Q        46        86        47
3  [B, K]  [46, 46]  [90, 67]  [46, 46]
4  (rest)       810    (rest)       810

score 0 · Answer 3 · answered Oct 25 '20 at 15:35

Use the following function:

def myFilter(col, maxOther = 0):
    unq = col.value_counts()
    if maxOther == 0:    # Return 3 MFV
        thr = unq.unique()[:3][-1]
        otherCnt = unq[unq < thr].sum()
        rv = col[col.isin(unq[unq >= thr].index)]
    else:    # Drop last LFV, no more than maxOther
        otherCnt = 0
        for i in unq[::-1]:
            if otherCnt + i >= maxOther: break
            otherCnt += i
        thrInd = unq.size - i + 1
        rv = col[col.isin(unq[:thrInd].index)]
    rv = rv.reset_index(drop=True)
    # print(f'  Trace {col.name}\nunq:\n{unq}\notherCnt: {otherCnt}')
    return rv

My assumption is that the distinction between both variants:

return 3 most frequent values (MFV),
drop last less frequent (other) values

is controlled by maxOther parameter. Its default value of 0 means "the first variant".

So to test both variants call it:

df.apply(myFilter) for the first variant,
df.apply(myFilter, maxOther=10) for the second variant.

To see the trace printout, uncomment print instruction in the function.

Binarize a lot of features with condition

3 Answers3

Sample Data

Desired Function