Removing rows that does not start with/contain specific words

Question

I have the following output

Age
'1 year old',
'14 years old', 
'music store', 
'7 years old ',
'16 years old ',

created after using this line of code

df['Age']=df['Age'].str.split('.', expand=True,n=0)[0]
df['Age'].tolist()

I would like to remove rows from the dataset (it would be better using a copy of it or a new one after filtering it) that does not start with a number or a number + year + old or a number + years + old.

Expected output

Age (in a new dataset filtered)
'1 year old',
'14 years old', 
'7 years old ',
'16 years old ',

How could I do?

use regex to filter: https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex — Roim, Jun 04 '20 at 18:48
`df['Age'].str.startswith()` is a good place to start, or `df['Age'].str.contains()` — G. Anderson, Jun 04 '20 at 18:49
Using `df['Age'] = [x for x in df['Age'] if not x.startswith('\d+')]` I got this AttributeError: 'bool' object has no attribute 'startswith' — , Jun 04 '20 at 18:58
u cant use regex with ```startswith```, only deals with the actual data, so to speak — sammywemmy, Jun 05 '20 at 00:06

score 1 · Accepted Answer · answered Jun 04 '20 at 19:03

1

Use, Series.str.contains and create a boolean mask to filter the dataframe:

m = df['Age'].str.contains(r'(?i)^\d+\syears?\sold')
df1 = df[m]

Result:

# print(df1)
             Age
0     1 year old
1   14 years old 
3    7 years old
4   16 years old

You can test the regex pattern here.

answered Jun 04 '20 at 19:03

Shubham Sharma

68,127
6
24
53

Thank you @Shubham Sharma. May I ask you how to include an OR condition in m? Is it ok to do as follows: `df['Age'].str.contains(r'(?i)^\d+\syears | otherword'))` ? hank you – Jun 04 '20 at 23:08
1

@Math yes that would be fine, but in that case it matches the strings like, `10 year, 20 YEARS, 30 Years, otherword,...` – Shubham Sharma Jun 05 '20 at 05:17

sammywemmy · Answer 2 · 2020-06-05T00:18:36.873

0

The code below looks for text that starts with an apostrophe, followed by a number, and keeps only those rows :

df = pd.read_clipboard(sep=';')


df.loc[df.Age.str.match("\'\d+")]

            Age
0   '1 year old',
1   '14 years old',
3   '7 years old ',
4   '16 years old ',

Note this just restricts to apostrophe and number, @Shubham's solution covers a lot more

edited Jun 05 '20 at 00:18

answered Jun 05 '20 at 00:12

sammywemmy

27,093
4
17
31

Removing rows that does not start with/contain specific words

2 Answers2