3

I have this regex :

cont_we_re = r"((?!\S+\s?(?:(cbf|cd3|cbm|m3|m[\\\>\?et]?|f3|ft3)))(?:([0-9,\.]+){2,})(?:\s*(?:(lb|kg)\.?s?))?)"

Right now, the logic followed is match with any numeric chunk optionally if followed by only kgs or lbs but don't match if cbf, cd3, cbm, m3 etc. are found after the numeric chunk. It works perfectly for these sample cases :

s1 = "18300 kg 40344.6 lbs 25000 m3"
s2 = "18300kg 40344.6lbs 25000m3"
s3 = "18300 kg   KO"
s4 = "40344.6 lb5   "
s5 = "40344.6  "

I'm using re.finditer() with re.IGNORECASE flag, like this :

for s in [s1, s2, s3, s4, s5]:
    all_val = [i.group().strip() for i in re.finditer(cont_we_re, s, re.IGNORECASE)]

Gives me this output :

['18300 kg', '40344.6 lbs']
['18300kg', '40344.6lbs']
['18300 kg']
['40344.6 lb']
['40344.6']

Now I'm trying to implement another logic : if we find numeric chunk followed by lbs then match it with first priority and return only that match, but if not found lbs and found only numeric chunk or numeric chunk followed by kgs then take those.

I've done this without changing the regex, like this :

for s in [s1, s2, s3, s4, s5]:
    all_val = [i.group().strip() for i in re.finditer(cont_we_re, s, re.IGNORECASE)]
    kg_val = [i for i in all_val if re.findall(r"kg\.?s?", i)]
    lb_val = [i for i in all_val if re.findall(r"lb\.?s?", i)]
    final_val = lb_val if lb_val else (kg_val if kg_val else list(set(all_val) - (set(kg_val+lb_val))))

This gives me the desired output (which is perfect) :

['40344.6 lbs']
['40344.6lbs']
['18300 kg']
['40344.6 lb']
['40344.6']

Question is how can I apply this same logic in the regex, without finding for kgs and lbs separately on each matched group by cont_we_re for each string. I tried "IF-THEN-ELSE" type regex as portrayed in this question but it doesn't work as the first part of the regex (? supposedly yields pattern error in python. Is there any way to do this with only cont_we_re regex?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Arkistarvh Kltzuonstev
  • 6,824
  • 7
  • 26
  • 56

2 Answers2

1

A possible solution using the PyPi module might be making use of (*SKIP)(*FAIL) and lookarounds to check for the presence of lb

(?:\d+(?:\.\d+)? ?lbs?|(?<!lb.*)(?!.*lb)\d+(?:\.\d+)?(?: kg)?|\d+(?:\.\d+)? ?kg(*SKIP)(*FAIL))
  • (?: Non capturing group
    • \d+(?:\.\d+)? ?lbs? Match the number format with an optional decimal part followed by lb and optional s
    • | Or
    • (?<!lb.*)(?!.*lb)\d+(?:\.\d+)?(?: kg)? Assert that the string does not contain lb, then match the number format with an optional decimal part followed by kg
    • | Or
    • \d+(?:\.\d+)? ?kg(*SKIP)(*FAIL) Match the number format followed by kg and skip that match
  • ) Close non capturing group

For example

import regex

s1 = "18300 kg 40344.6 lbs 25000 m3"
s2 = "18300kg 40344.6lbs 25000m3"
s3 = "18300 kg   KO"
s4 = "40344.6 lb5   "
s5 = "40344.6  "

cont_we_re = r"(?:\d+(?:\.\d+)? ?lbs?|(?<!lb.*)(?!.*lb)\d+(?:\.\d+)?(?: kg)?|\d+(?:\.\d+)? ?kg(*SKIP)(*FAIL))"


for s in [s1, s2, s3, s4, s5]:
    all_val = [i.group().strip() for i in regex.finditer(cont_we_re, s, regex.IGNORECASE)]
    print(all_val)

Output

['40344.6 lbs']
['40344.6lbs']
['18300 kg']
['40344.6 lb']
['40344.6']

Python demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

This regex relies on a bunch of if-then-else tests supported by the regex package from the PyPI repository. It first looks to see if the input string contains a number followed by kg. If it does if then matches the number and kg and the optional characters that may follow (\.?\s), though I believe this should be reversed according to standard English usage, for example 10kgs. and not 10kg.s). If it cannot find this match (here comes the else part), then it looks for a number followed by lb and then if successful it matches that. And finally, if that fails it just looks for a number. Perhaps not the most efficient way, but it works up to a point.

Test case s6 shows that even though a kg amount follows a lb amount, it is selected anyway. Test case s7 shows that numbers followed by cbf, for example, are ignored.

import regex as re

cont_we_re = r"""
        (?(?=.*?[0-9,.]{2,}\s*kg)
            .*?(?P<VAL>[0-9,.]{2,}\s*kg\.?s?)
            |
            .*?(?(?=.*?[0-9,.]{2,}\s*lb)
                (?P<VAL>[0-9,.]{2,}\s*lb\.?s?)
                |
                (?(?=[0-9,.]{2,}\s?(cbf|cd3|cbm|m3|m[\\\>\?et]?|f3|ft3))(*SKIP)(*FAIL)|(?P<VAL>[0-9,.]{2,}))
            )
        )
        """

rex = re.compile(cont_we_re, flags=re.X|re.I)

s1 = "18300 kg 40344.6 lbs 25000 m3"
s2 = "18300kg 40344.6lbs 25000m3"
s3 = "18300 kg   KO"
s4 = "40344.6 lb5   "
s5 = "40344.6  "
s6 = "40344.6  128 LB.S 19kg"
s7 = "101.99 cbf  128"

vals = []
for s in [s1, s2, s3, s4, s5, s6, s7]:
    m = rex.search(s)
    vals.append(m['VAL'])
print(vals)

Prints:

['18300 kg', '18300kg', '18300 kg', '40344.6 lb', '40344.6', '19kg', '128']

UPDATE

I just realized that pounds (LB) are to take precedence over kilograms (KG), in which case the regex should be:

cont_we_re = r"""
        (?(?=.*?[0-9,.]{2,}\s*lb)
            .*?(?P<VAL>[0-9,.]{2,}\s*lb\.?s?)
            |
            .*?(?(?=.*?[0-9,.]{2,}\s*kg)
                (?P<VAL>[0-9,.]{2,}\s*kg\.?s?)
                |
                (?(?=[0-9,.]{2,}\s?(cbf|cd3|cbm|m3|m[\\\>\?et]?|f3|ft3))(*SKIP)(*FAIL)|(?P<VAL>[0-9,.]{2,}))
            )
        )
        """

and the results:

['40344.6 lbs', '40344.6lbs', '18300 kg', '40344.6 lb', '40344.6', '128 LB.S', '128']
Booboo
  • 38,656
  • 3
  • 37
  • 60