Drop % of rows that do not contain specific string

Question

I want to drop 20% of rows that do not contain 'p' or 'u' in label column. I know how to drop all of them, but I do not know how to drop certain percent of rows. This is my code:

import pandas as pd

df = pd.DataFrame({"text": ["a", "b", "c", "d", "e", "f", "g", "h"],
                   "label": ["o-o-o", "o-o", "o-u", "o", "o-o-p-o", "o-o-o-o-o-o", "p-o-o", "o-o"]
})

print(df)

df = df[(df["label"].str.contains('p')) | (df["label"].str.contains('u'))]
print(df)

Remember that for str.contains with multiple 'or' you can write: df["label"].str.contains('p|u', regex=True) — Arkadiusz, Oct 21 '21 at 13:16

score 2 · Accepted Answer · answered Oct 21 '21 at 13:18

Use:

#for unique indices
df = df.reset_index(drop=True)

#get mask for NOT contains p or u
m = ~df["label"].str.contains('p|u')

#get 20% Trues from m
#https://stackoverflow.com/a/31794767/2901002
mask = np.random.choice([True, False], m.sum(), p=[0.2, 0.8])

#filter both masks and remove rows
df = df.drop(df.index[m][mask])

score 0 · Answer 2 · answered Oct 21 '21 at 13:22

You can update your code like that:

import pandas as pd

df = pd.DataFrame({"text": ["a", "b", "c", "d", "e", "f", "g", "h"],
                   "label": ["o-o-o", "o-o", "o-u", "o", "o-o-p-o", "o-o-o-o-o-o", "p-o-o", "o-o"]
})

print(df)
drop_list = df[(~df["label"].str.contains('p')) & (~df["label"].str.contains('u'))].index.tolist()
number_of_drop = int(len(drop_list)*0.2)
drop_list = drop_list[:number_of_drop]
df.drop(number_of_drop, inplace=True)
print(df)

Drop % of rows that do not contain specific string

2 Answers2