-1

I have a string which contains the number of processors:

SQLDB_GP_Gen5_2

The number is after _Gen and before _ (the number 5). How can I extract this using python and regular expressions?

I am trying to do it like this but don't get a match:

re.match('_Gen(.*?)_', 'SQLDB_GP_Gen5_2')

I was also trying this using pandas:

x['SLO'].extract(pat = '(?<=_Gen).*?(?:(?!_).)')

But this also wasn't working. (x is a Series)

Can someone please also point me to a book/tutorial site where I can learn regex and how to use with Pandas.

Thanks,

Mick

Mick
  • 1,401
  • 4
  • 23
  • 40

3 Answers3

2

re.match searches from the beginning of the string. Use re.search instead, and retrieve the first capturing group:

>>> re.search(r'_Gen(\d+)_', 'SQLDB_GP_Gen5_2').group(1)
'5'
Rúben
  • 435
  • 2
  • 6
2

You need to use Series.str.extract with a pattern containing a capturing group:

x['SLO'].str.extract(r'_Gen(.*?)_', expand=False)
        ^^^^           ^^^^^^^^^^^

To only match a number, use r'_Gen(\d+)_'.

NOTES:

  • With Series.str.extract, you need to use a capturing group, the method only returns any value if it is captured
  • r'_Gen(.*?)_' will match _Gen, then will capture any 0+ chars other than line break chars as few as possible, and then match _. If you use \d+, it will only match 1+ digits.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Using re :

re.findall(r'Gen(.*)_',text)[0]
Vikas Periyadath
  • 3,088
  • 1
  • 21
  • 33