2

I'm trying to process freeform text files to get contextual information using regex, however the regex that I'm using isn't working as expected.

These are some of the examples (from large bodies of text) that I'd like to process:

  • 2 cans of beer per week
  • 8-10 beers 1x/week
  • reports that he drinks 1-2x/week and drinks 12-pack of beers during that time
  • ~6 pack of beer throughout the week, 1-3 beers in one sitting
  • 2-4 7oz glasses of wine 4 x week
  • daily drinker (1 pack a day)

I am trying to extract different portions with:

sizes = {  # drink volume in oz
            'oz': 1,
            '[^a-z]cans?[^a-z]': 12,
            '[^a-z]glass': 5,
            'bottle': 25,
            'shot': 1.5,
            'pint': 16,
            'fifth': 25,
            'large can': 22,
            '6.{,3}pack': 72,
            '12.{,3}pack': 144,
}
szre = r'|'.join('(%s)' % sz for sz in sizes)
''.join(re.findall('\d+((\-|/|\.)\d+)?.{,3}((%s)|(%s))' % (szre, sbre), line)[0])

and sbre is a similar group expression. So far, running the code gives me

(Pdb++) line1 = '2 cans of beer per week'
(Pdb++) ''.join(re.findall('\d+((\-|/|\.)\d+)?.{,3}((%s)|(%s))' % (szre, sbre), line1)[0])
' cans  cans  cans '

instead of an expected

'2 cans '

How can I get this to work better? BTW, I've already noticed that re.findall() does weird things with groups, hence the workaround with ''.join().

Thanks.

EDIT: Link to source: https://github.com/skeledrew/medical-nlp-research/blob/master/add_feats.py Actual data cannot be provided of course due to PHI sensitivity.

EDIT: MCVE as requested

>>> szre = '([^a-z]glass)|(pint)|([^a-z]cans?[^a-z])|(fifth)|(bottle)|(6.{,3}pack)|(12.{,3}pack)|(large can)|(shot)'
>>> sbre = '(wine)|(drink)|(liquor)|(milwaukee best ice)'
line1 = '2 cans of beer per week'
>>> ''.join(re.findall(r'\d+(?:[-/.]\d+)? {0,3}%s%s' % (szre, sbre), line1)[0])
' cans '  # still wrong
>>> re.search('\d+((\-|/|\.)\d+)?.{,3}((%s)|(%s))' % (szre, sbre), line1)
<_sre.SRE_Match object; span=(0, 7), match='2 cans '>  # correct span/match
skeledrew
  • 31
  • 6
  • 2
    No language processing can be generalized with regex. It's can't, nor ever will happen –  Aug 15 '17 at 16:04
  • Replace all `(` with `(?:` – Wiktor Stribiżew Aug 15 '17 at 16:08
  • @WiktorStribiżew I'm not seeing how my post is a duplicate of the other. I was just acknowledging that I noticed the behavior with re.findall, not basing my post on it. My question is why it gets only 'can' when there is a number-specific portion in the expression? – skeledrew Aug 15 '17 at 16:24
  • @skeledrew Post the relevant code so that the issue could be reproduced. Right now, it does look like you just used capturing groups instead of non-capturing. And that is what the linked post is all about .Did you actually try what I suggested above? – Wiktor Stribiżew Aug 15 '17 at 16:26
  • @sln I don't mean for this to be a comprehensive solution, just a small part of a larger processing task. – skeledrew Aug 15 '17 at 16:27
  • @WiktorStribiżew I'm still fairly new to using REs, and am not quite sure of the difference between capturing and non-capturing groups. However I will push and link my code. – skeledrew Aug 15 '17 at 16:32
  • Did you try `'\d+(?:(?:\-|/|\.)\d+)?.{0,3}(?:(?:%s)|(?:%s))'`? Although now, I see that this `(?:(?:%s)|(?:%s))` makes little sense and can be written as `%s`. And you should not omit `0` in the limiting quantifier, it must be `{0,3}`, not `{,3}`. Also, `(?:\-|/|\.)` is better written as `[-/.]`. Ok, try `r'\d+(?:[-/.]\d+)?.{0,3}%s'` – Wiktor Stribiżew Aug 15 '17 at 16:34
  • Thanks @WiktorStribiżew, that reduced the example result to ' cans '. But it's still missing the digit. I may end up just doing this with re.search instead, though it'll make the code more complex. Also needs %s%s to get both inclusion args. – skeledrew Aug 15 '17 at 16:52
  • Please provide [MCVE (minimal complete verifiable example)](http://stackoverflow.com/help/mcve). – Wiktor Stribiżew Aug 15 '17 at 16:58

0 Answers0