How to extract certain length of numbers from a string in python?

Question

I have a dataframe which looks like this:

description     
1906 RES 330 ML
1906 RES 330ML
RES 335 c/6
RES 332 c/12

I want to extract the three consecutive digits of numbers and save it in a new column 'volume'. My code is like this:

df['volume'] = df['description'].str.extract('([([\d]*[\d]){3,3}?])')

EXPECTED RESULTS SHOULD BE LIKE THIS:

volume
330
330
335
332

However, it gives the results like this:

volume
1906
1906
335
332

Can anyone help me fix this code? Thanks so much!!!

cs95 · Accepted Answer · 2017-08-28T19:11:12.963

5

Might be overkill, but if you want to make sure you don't capture numbers that are part of 4 digit numbers, you might use this:

df['volume'] = df.description.str.extract(r'(?<!\d)(\d{3})(?!\d)', expand=False)    
print(df)

       description volume
0  1906 RES 330 ML    330
1   1906 RES 330ML    330
2      RES 335 c/6    335
3     RES 332 c/12    332

Specify expand=False, so that matches are returned as one pd.Series only.

The regex:

(?<!\d) - specifies that anything before a set of 3 digits is something that is not a digit
(\d{3}) - matches 3 digits
(?!\d) - specifies that anything after a set of 3 digits is something that is not a digit

edited Aug 28 '17 at 19:11

answered Aug 28 '17 at 18:22

cs95

379,657
97
704
746

Maybe `r'(?<!\d)(\d{3,3})(?!\d)'` – Wiktor Stribiżew Aug 28 '17 at 18:24
why `\d{3,3}` why not just `\d{3}` ? – JBone Aug 28 '17 at 19:10
@JBone because I'm still relatively inexperienced with regex. Thanks for the correction. I'll add that in. – cs95 Aug 28 '17 at 19:11

Yunnosch · Answer 2 · 2017-08-28T18:39:15.493

2

You need to

not match any number of digits, three times, so delete the [\d]*
not match 3 digits within anything looking like a "word",
especially not other digits, so use word boundary \b
not allow optional ?
not overdo the character set thing []

You do not need to:

use two capture groups ()

This regex will find exactly three digits, alone:

\b(\d{3})\b

edited Aug 28 '17 at 18:39

answered Aug 28 '17 at 18:32

Yunnosch

26,130
9
42
54

score 0 · Answer 3 · answered Aug 28 '17 at 20:16

0

The regex you are looking for is \b[\d]{3}\b

for more information on \b see docs

answered Aug 28 '17 at 20:16

yugantar

1,970
1
11
17

How to extract certain length of numbers from a string in python?

3 Answers3

Linked