I'm trying to process freeform text files to get contextual information using regex, however the regex that I'm using isn't working as expected.
These are some of the examples (from large bodies of text) that I'd like to process:
- 2 cans of beer per week
- 8-10 beers 1x/week
- reports that he drinks 1-2x/week and drinks 12-pack of beers during that time
- ~6 pack of beer throughout the week, 1-3 beers in one sitting
- 2-4 7oz glasses of wine 4 x week
- daily drinker (1 pack a day)
I am trying to extract different portions with:
sizes = { # drink volume in oz
'oz': 1,
'[^a-z]cans?[^a-z]': 12,
'[^a-z]glass': 5,
'bottle': 25,
'shot': 1.5,
'pint': 16,
'fifth': 25,
'large can': 22,
'6.{,3}pack': 72,
'12.{,3}pack': 144,
}
szre = r'|'.join('(%s)' % sz for sz in sizes)
''.join(re.findall('\d+((\-|/|\.)\d+)?.{,3}((%s)|(%s))' % (szre, sbre), line)[0])
and sbre is a similar group expression. So far, running the code gives me
(Pdb++) line1 = '2 cans of beer per week'
(Pdb++) ''.join(re.findall('\d+((\-|/|\.)\d+)?.{,3}((%s)|(%s))' % (szre, sbre), line1)[0])
' cans cans cans '
instead of an expected
'2 cans '
How can I get this to work better? BTW, I've already noticed that re.findall() does weird things with groups, hence the workaround with ''.join().
Thanks.
EDIT: Link to source: https://github.com/skeledrew/medical-nlp-research/blob/master/add_feats.py Actual data cannot be provided of course due to PHI sensitivity.
EDIT: MCVE as requested
>>> szre = '([^a-z]glass)|(pint)|([^a-z]cans?[^a-z])|(fifth)|(bottle)|(6.{,3}pack)|(12.{,3}pack)|(large can)|(shot)'
>>> sbre = '(wine)|(drink)|(liquor)|(milwaukee best ice)'
line1 = '2 cans of beer per week'
>>> ''.join(re.findall(r'\d+(?:[-/.]\d+)? {0,3}%s%s' % (szre, sbre), line1)[0])
' cans ' # still wrong
>>> re.search('\d+((\-|/|\.)\d+)?.{,3}((%s)|(%s))' % (szre, sbre), line1)
<_sre.SRE_Match object; span=(0, 7), match='2 cans '> # correct span/match