How can I remove features/columns of a data frame if at least 20% of their entries take a specific value, using pandas/numpy?

Question

I am using pandas and numpy.

I want to remove every column in my 9000 x 13 training data frame where at least 20% of the entries take the value -200. In this case, -200 is like a missing value or NaN, so I am removing variables that aren't useful. I have a sample of the data below. Any help would be appreciated. This is some kind of attempt: train_mod = train.loc[:, train.isnull().mean() <.2]

A        B        C            D      E                 F          \
5723     0.5       846.25      -200    2.619270         627.50     79.0   
4014     1.5      1016.25      -200    6.810175         848.50     99.0   
4074     2.0      -200.00      -200    -200.000        -200.00    114.0   
4577     1.6       950.50      -200    8.649763         925.50    351.0   
6691     4.7      1469.75      -200   25.820425        1449.75    677.0   
2889     0.5       902.50      -200    2.676091         631.25   -200.0   
4387     2.0      1095.75      -200   12.972673        1082.75    310.0   
4289     1.0       885.50      -200    2.695146         632.50   -200.0   
2887     2.3      1355.00      -200   16.611225        1198.25    129.0   
5694     1.1       936.25      -200    6.821513         849.00    127.0

why you want to do that. Do you want to reduce the memory size ? — Shrey, Oct 24 '19 at 12:13
Possible duplicate of [How to delete a column in pandas dataframe based on a condition?](https://stackoverflow.com/questions/31614804/how-to-delete-a-column-in-pandas-dataframe-based-on-a-condition) — Florian Bernard, Oct 24 '19 at 12:22

Dan · Answer 1 · 2019-10-24T14:50:04.183

0

val = -200
rows_20 = df.shape[0] // 5

train_mod = train[[c for c in train.columns if (train[c] == val).sum() <= rows_20]]

Explanation:

train[c] == val returns a Series of booleans (True/False). Calling .sum() on this counts the number of "True" values. This is then checked against row_20, which is 20% of the number of rows in the DataFrame.

The list comprehension returns only columns that match the if condition.

edited Oct 24 '19 at 14:50

answered Oct 24 '19 at 12:24

Dan

1,575
1
11
17

Code-only answers are generally frowned upon on this site. Could you please edit your answer to include some comments or explanation of your code? Explanations should answer questions like: What does it do? How does it do it? Where does it go? How does it solve OP's problem? See: [How to anwser](https://stackoverflow.com/help/how-to-answer). Thanks! – Eduardo Baitello Oct 24 '19 at 14:44
@EduardoBaitello fair enough, although that link doesn't mention anything about code only answers – Dan Oct 24 '19 at 14:47

score 0 · Answer 2 · answered Oct 24 '19 at 12:24

You may try creating a new "tagging_column" using numpy.where(). Then use it to create a counting column in groupby then aggregate on count. Then finally compute for the ratio. If the ratio is >=20% Drop all tagged at minus 200 or lower value.

Consider:

>>> df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9], 'val':[100,200,-250,2000,20312039,12485,-300,-350,-60494]})
>>> df
   id       val
0   1       100
1   2       200
2   3      -250
3   4      2000
4   5  20312039
5   6     12485
6   7      -300
7   8      -350
8   9    -60494

>>> df['Check Negative 200'] = np.where(df['val'] <=-200, ['Negative 200 or lower'], ['Greater than -200'])
>>> df
   id       val     Check Negative 200  Count
0   1       100      Greater than -200      5
1   2       200      Greater than -200      5
2   3      -250  Negative 200 or lower      4
3   4      2000      Greater than -200      5
4   5  20312039      Greater than -200      5
5   6     12485      Greater than -200      5
6   7      -300  Negative 200 or lower      4
7   8      -350  Negative 200 or lower      4
8   9    -60494  Negative 200 or lower      4

>>> df['Count'] = df.groupby('Check Negative 200')['Check Negative 200'].transform('count')
>>> df
   id       val     Check Negative 200  Count
0   1       100      Greater than -200      5
1   2       200      Greater than -200      5
2   3      -250  Negative 200 or lower      4
3   4      2000      Greater than -200      5
4   5  20312039      Greater than -200      5
5   6     12485      Greater than -200      5
6   7      -300  Negative 200 or lower      4
7   8      -350  Negative 200 or lower      4
8   9    -60494  Negative 200 or lower      4

>>> dd = dict(df['Check Negative 200'].value_counts())
>>> dd
{'Greater than -200': 5, 'Negative 200 or lower': 4}

if dd['Negative 200 or lower']/len(df) > .2:
    df = df[df['Check Negative 200'].isin(['Greater than -200'])]
else:
    pass

>>> df
   id       val Check Negative 200  Count
0   1       100  Greater than -200      5
1   2       200  Greater than -200      5
3   4      2000  Greater than -200      5
4   5  20312039  Greater than -200      5
5   6     12485  Greater than -200      5

You can also drop the added columns so your columns would remain the same as the input.

>>> del df['Check Negative 200']
>>> del df['Count']
>>> df.reset_index(inplace = True, drop = True)
>>> df
   id       val
0   1       100
1   2       200
2   4      2000
3   5  20312039
4   6     12485

score 0 · Answer 3 · answered Oct 24 '19 at 13:00

This is a somewhat convoluted one-liner that should do the trick:

df_mod = df[df.columns[(df == -200).sum()/df.shape[0] < 0.2]]

Essentially, going from the inside out, we're seeing how many values in each column are invalid (df == -200). The sum over that bracket sums over boolean values, so it counts the True entries. We divide this by the number of rows, and where this value is less than 20%, the columns are kept. We then take the DataFrame where the columns fulfill this condition, and are left with the columns where less than 20% of entries are invalid.

Hope this helps!

How can I remove features/columns of a data frame if at least 20% of their entries take a specific value, using pandas/numpy?

3 Answers3