1

I have a dataframe which looks like this:

description     
1906 RES 330 ML
1906 RES 330ML
RES 335 c/6
RES 332 c/12

I want to extract the three consecutive digits of numbers and save it in a new column 'volume'. My code is like this:

df['volume'] = df['description'].str.extract('([([\d]*[\d]){3,3}?])')

EXPECTED RESULTS SHOULD BE LIKE THIS:

volume
330
330
335
332

However, it gives the results like this:

volume
1906
1906
335
332

Can anyone help me fix this code? Thanks so much!!!

cs95
  • 379,657
  • 97
  • 704
  • 746
Elsa Li
  • 673
  • 3
  • 9
  • 19

3 Answers3

5

Might be overkill, but if you want to make sure you don't capture numbers that are part of 4 digit numbers, you might use this:

df['volume'] = df.description.str.extract(r'(?<!\d)(\d{3})(?!\d)', expand=False)    
print(df)

       description volume
0  1906 RES 330 ML    330
1   1906 RES 330ML    330
2      RES 335 c/6    335
3     RES 332 c/12    332

Specify expand=False, so that matches are returned as one pd.Series only.


The regex:

  • (?<!\d) - specifies that anything before a set of 3 digits is something that is not a digit
  • (\d{3}) - matches 3 digits
  • (?!\d) - specifies that anything after a set of 3 digits is something that is not a digit
cs95
  • 379,657
  • 97
  • 704
  • 746
2

You need to

  • not match any number of digits, three times, so delete the [\d]*
  • not match 3 digits within anything looking like a "word",
    especially not other digits, so use word boundary \b
  • not allow optional ?
  • not overdo the character set thing []

You do not need to:

  • use two capture groups ()

This regex will find exactly three digits, alone:

\b(\d{3})\b
Yunnosch
  • 26,130
  • 9
  • 42
  • 54
0

The regex you are looking for is \b[\d]{3}\b

for more information on \b see docs

yugantar
  • 1,970
  • 1
  • 11
  • 17