1

Here is an example of my input sentences. I want to extract numbers from sentences which ends with mm or cm. Here is the regular expression I have tried to make.

 sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size' 

 re.findall(r'(\d+) cm',sen)

This gives the output as

 ['0']

Then I just tried to extract numbers without conditions as

 print (re.findall('\d+', sen ))

This gives the output as

 ['1', '9', '1', '4', '2', '0']

My expected output is

 ['1.9x1.4x2.0'] or ['1.9', '1.4', '2.0']

Not duplicate because I am also looking for a way to cm, mm plus float numbers.

khushbu
  • 567
  • 2
  • 8
  • 24

5 Answers5

3

You could use 3 capturing groups to get the digits and make sure that the measurements end on cm or mm using a character class.

(?<!\S)(\d+\.\d+)x(\d+\.\d+)x(\d+\.\d+) [cm]m(?!\S)

In parts

  • (?<!\S) Negative lookbehind, assert what is directly on the left is not a non whitespace char
  • (\d+\.\d+)x Capture group 1, match 1+ digits and a decimal part, then match x
  • (\d+\.\d+)x Capture group 2 Same as above
  • (\d+.\d+) Capture group 3 Match 1+ digits and a decimal part
  • [cm]m Match cm or mm
  • (?!\S) Negative lookahead, assert what is directly on the left is not a non whitespace char

Regex demo | Python demo

For example

import re

regex = r"(?<!\S)(\d+\.\d+)x(\d+\.\d+)x(\d+\.\d+) [cm]m(?!\S)"
test_str = "The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size"

print(re.findall(regex, test_str))

Output

[('1.9', '1.4', '2.0')]

To get the output including the x you could use

(?<!\S)(\d+\.\d+x\d+\.\d+x\d+\.\d+) [cm]m(?!\S)

Regex demo | Python demo

Output

['1.9x1.4x2.0']

Edit

To match only the values and allow 1 or more spaces between the digits and the value you could use a positive lookahead:

\d+(?:\.\d+)?(?:(?:x\d+(?:\.\d+)?)*)?(?=[ \t]+[cm]m)

Regex

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • I am wondering why this doesn't work for sentence like "Other two ill defined, small ground glass lesions measured 4.0 mm and 3 mm in size, respectively." Could you please help? – khushbu Sep 12 '19 at 07:13
  • @pari If you also want to match that format you could make the first part which could contain `x` optional. `(?<!\S)((?:\d+\.\d+x\d+\.\d+x)?\d+(?:\.\d+)?) [cm]m(?!\S)` See https://regex101.com/r/AeM6Gp/1 – The fourth bird Sep 12 '19 at 09:48
  • so do you suggest me to remove "x" in second step or Can I write something that in one step to both extract numbers ending with either mm or cm and removing 'x' as well? – khushbu Sep 12 '19 at 10:10
  • @pari That depends on what you want to match. You could make all parts optional like this https://regex101.com/r/CDBSWB/1 or you could match either the whole part with the `x` or a separate part https://regex101.com/r/wH13ns/1 – The fourth bird Sep 12 '19 at 10:15
  • Please excuse my ignorance, I am trying to learn to write regex. df1['numbers3'] = [[float(l) or l for l in x] for x in df1['TEXT'].str.findall(r'\d+\.\d+')] This way I am extracting numbers from text and converting them into float for further calculations. With what you have suggested I wrote, df1['numbers2'] = [[float(l) or l for l in x] for x in df1['TEXT'].str.findall(r'(?<!\S)(?:(\d+(?:\.\d+)?)x(\d+(?:\.\d+)?)x)?(\d+(?:\.\d+)?) [cm]m(?!\S)')] And this gives me error as "ypeError: float() argument must be a string or a number, not 'tuple' " – khushbu Sep 13 '19 at 10:20
  • @pari That is what [re.findall](https://docs.python.org/3/library/re.html#re.findall) returns `If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group` – The fourth bird Sep 13 '19 at 10:23
  • I want to extract numbers ending with 'mm' or 'cm' only and not random numbers. Further calculations would be like if it is mm, I would convert it into cm. – khushbu Sep 13 '19 at 10:24
  • @pari If you want a single capturing group, you could use `(\d+(?:\.\d+)?(?:(?:x\d+(?:\.\d+)?)*)?) [cm]m` https://regex101.com/r/CzdRda/1 and use the capturing group in the replacement. The `mm` or `cm` that is matched you can replace with another value. – The fourth bird Sep 13 '19 at 11:06
  • This regex will not work when there is more than one space between the number and mm or cm. How can I fix this regex to work even if there are more spaces but it only has to check if mm or cm is there after the number since my data does contain this heterogeneity? – khushbu Sep 15 '19 at 11:13
  • 1
    @pari You might use a character class to match 1+ spaces or tabs. To get only the values you could match them without using a group and use a positive lookahead to assert the cm or mm `\d+(?:\.\d+)?(?:(?:x\d+(?:\.\d+)?)*)?(?=[ \t]+[cm]m)` https://regex101.com/r/8NnW9D/1 – The fourth bird Sep 15 '19 at 17:13
  • ok this works, can you edit your answer with this, So I will accept it as the answer. Also, can we have private chat, I had more to discuss? – khushbu Sep 15 '19 at 17:47
  • @pari I have added an update to the answer with this pattern. If you have another question, my advise would be to create a new question. – The fourth bird Sep 15 '19 at 17:52
0

You can use a lookahead with re.findall:

import re
sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size' 
result = re.findall(r'[\dx\.]+(?=\scm)', sen)

Output:

['1.9x1.4x2.0']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
0

Try this :

sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size' 
import re
re.findall('\d+\.\d+', sen)

Output :

['1.9', '1.4', '2.0']
Arkistarvh Kltzuonstev
  • 6,824
  • 7
  • 26
  • 56
0

Heres another approach:

import re
sen = 'The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size' 
output = re.findall('\d.\d', sen)

output:

['1.9', '1.4', '2.0']
Nouman
  • 6,947
  • 7
  • 32
  • 60
0
import re    
sen = '''The study reveals a speculated nodule with pleural tagging at anterior basal 
segment of LLL, measured 1.9x1.4x2.0 cm in size'''

print (re.findall('[\d\.]+', sen ))

Output

['1.9', '1.4', '2.0']
Divyesh patel
  • 967
  • 1
  • 6
  • 21