1

I have the following output

Age
'1 year old',
'14 years old', 
'music store', 
'7 years old ',
'16 years old ',

created after using this line of code

df['Age']=df['Age'].str.split('.', expand=True,n=0)[0]
df['Age'].tolist()

I would like to remove rows from the dataset (it would be better using a copy of it or a new one after filtering it) that does not start with a number or a number + year + old or a number + years + old.

Expected output

Age (in a new dataset filtered)
'1 year old',
'14 years old', 
'7 years old ',
'16 years old ',

How could I do?

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • use regex to filter: https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex – Roim Jun 04 '20 at 18:48
  • `df['Age'].str.startswith()` is a good place to start, or `df['Age'].str.contains()` – G. Anderson Jun 04 '20 at 18:49
  • Using `df['Age'] = [x for x in df['Age'] if not x.startswith('\d+')]` I got this AttributeError: 'bool' object has no attribute 'startswith' –  Jun 04 '20 at 18:58
  • u cant use regex with ```startswith```, only deals with the actual data, so to speak – sammywemmy Jun 05 '20 at 00:06

2 Answers2

1

Use, Series.str.contains and create a boolean mask to filter the dataframe:

m = df['Age'].str.contains(r'(?i)^\d+\syears?\sold')
df1 = df[m]

Result:

# print(df1)
             Age
0     1 year old
1   14 years old 
3    7 years old
4   16 years old

You can test the regex pattern here.

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Thank you @Shubham Sharma. May I ask you how to include an OR condition in m? Is it ok to do as follows: `df['Age'].str.contains(r'(?i)^\d+\syears | otherword'))` ? hank you –  Jun 04 '20 at 23:08
  • 1
    @Math yes that would be fine, but in that case it matches the strings like, `10 year, 20 YEARS, 30 Years, otherword,...` – Shubham Sharma Jun 05 '20 at 05:17
0

The code below looks for text that starts with an apostrophe, followed by a number, and keeps only those rows :

df = pd.read_clipboard(sep=';')


df.loc[df.Age.str.match("\'\d+")]

            Age
0   '1 year old',
1   '14 years old',
3   '7 years old ',
4   '16 years old ',

Note this just restricts to apostrophe and number, @Shubham's solution covers a lot more

sammywemmy
  • 27,093
  • 4
  • 17
  • 31