3

In my data_cleaner dataset I have the column (feature) 'Project ID'. This identifies the project and it has a format 'code/YEAR/code'. I'm only interested in the project's year so I want to get rid of everything before the first / and everything after the second /.

Project ID  
AGPG/2013/1 
AGPG/2013/10
AGPG/2013/12
AGPG/2013/18
AGPG/2013/19

The closest I got was to strip what's before with

data_cleaner['Project ID'] = data_cleaner['Project ID'].str.strip("AGPG")

(but down the line there are other letters so this is not escalable)

And then I did

data_cleaner['Project ID'] = data_cleaner['Project ID'].str.strip('/')

This got rid of the first bit, I can't manage to get rid of what's after the year.

Project ID  
2013/1  
2013/10
2013/12
2013/18
2013/19

I read this post but didn't help me Pandas DataFrame: remove unwanted parts from strings in a column

nahusznaj
  • 463
  • 4
  • 15

1 Answers1

2

I believe need split and select second lists:

data_cleaner['Project ID'] = data_cleaner['Project ID'].str.split('/').str[1]

Or extract by regex - /(\d{4})/ means get numeric with length 4 between //:

data_cleaner['Project ID'] = data_cleaner['Project ID'].str.extract('/(\d{4})/', expand=False)

print (data_cleaner)
  Project ID
0       2013
1       2013
2       2013
3       2013
4       2013
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252