1

I am trying to extract measurements from a file using Python. I want to extract them with specification words. For example:

Width 3.5 in
Weight 10 kg

I used the following code:

p = re.compile('\b?:Length|Width|Height|Weight (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H.P.)\b')
print(p.findall(text))

However, it only outputs the first word (just "Height" or "Length") and completely misses the rest. Is there something I should fix in the above regular expression?

===== UPDATE: For some reason, online regex tester and my IDE give me completely different results for the same pattern: Tester matches everything

expression = r"""\b
            (?:
              [lL]ength\ +(?P<Length>\d+(?:\.\d+)?|\d+-\d+\/\d+)\ +(?:in|ft|cm|m)|
              [wW]idth\ +(?P<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
              [wW]eight\ +(?P<Weight>\d+(?:\.\d+)?|\d+-\d)\ +(?:oz|lb|g|kg)|
              Electrical\ +(?P<Electrical>[^ ]+)\ +(?:VAC|Hz|[aA]mps)
            )
            \b
    """

    print(re.findall(expression,text,flags=re.X|re.MULTILINE|re.I))

returns me [('17-13/16', '', '', '')] for the same input.

Is there something I should update?

poisonedivy
  • 479
  • 4
  • 7
  • There appears to be a space between each pattern in your regex statement. Does this space also appear in the text? (Because the space is (likely) being matched). – S3DEV Jul 15 '20 at 20:55
  • The `\b` should not be optional, you are missing `kg` at the end and the alternation at the start should be within a group `\b(?:Length|Width|Height|Weight) (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H\.P\.|kg)\b` https://regex101.com/r/TySeb7/1 – The fourth bird Jul 15 '20 at 20:57
  • The first word in the sample text is "Width", not "Height" or "Length". Please provide a [mre]. – martineau Jul 15 '20 at 21:00
  • I think it is helpful if you provide (a part of) the real text you are trying to search and the output you expect based on the example input. – Ronald Jul 15 '20 at 21:01
  • `'\b'` != `r'\b'`. To match a literal dot as in H.P. you need to escape it, and if you use a word boundary after `\.`, it will require a word char immediately on the right. – Wiktor Stribiżew Jul 15 '20 at 21:03

2 Answers2

2

There are a few issues with the pattern:

  • You can not put a quantifier ? after the word boundary
  • The alternatives Length|Width etc should be within a grouping structure
  • Add kg at the last alternation
  • Escape the dots to match them literally
  • Assert a whitespace boundary at the end (?!\S) because H.P. is one of the options and will not match when using \b and followed by a space for example

For example

\b(?:Length|Width|Height|Weight) (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H\.P\.|kg)(?!\S)

Regex demo | Python demo

Also note Wiktor Stribiżew comment about \b. This page explains the difference.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 1
    Thank you so much! How would I make the Length / Width / Height words optional at the beginning? – poisonedivy Jul 15 '20 at 22:09
  • @poisonedivy You can make the alternation at the beginning optional, including the trailing space if you want `\b(?:(?:Length|Width|Height|Weight) )?(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H\.P\.|kg)(?!\S)` https://regex101.com/r/bw1eLF/1 – The fourth bird Jul 15 '20 at 22:11
2

Consider using the following regular expression, which ties the format of the values and the units of measurement to the element being matched.

\b
(?:
  Length\ +(?<Length>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
  Width\ +(?<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
  Weight\ +(?<Weight>\d+)\ +(?:oz|lb|g|kg)
)
\b

I've written this with the x ("extended") flag (which ignores whitespace) to make it easier to read. For that reason I needed to have escaped the space characters. (Alternatively, I could have put each in a character class.)

As seen, "Length" and "Width" require the value to be an integer or a float and the units to be any of "in", "ft", "cm" or "m", whereas "Weight" requires the value to be an integer and the units to be any of "oz", "lb", "g" or "kg". It could of course be extended in the obvious way.

Start your engine!

Python's regex engine performs the following operations.

\b                 : assert word boundary
(?:                : begin non-capture group
  Length +         : match 'Length' then 1+ spaces 
  (?<Length>       : begin named capture group 'Length'
    \d+(?:\.\d+)?  : match 1+ digits
    (?:\.\d+)?
  )                : close named capture group
  \ +              : match 1+ spaces
  (?:in|ft|cm|m)   : match 'in', 'ft', 'cm' or 'm' in a
                     non-capture group 
|                  : or
  Width\ +         : similar to above
  (?<Width>        :       ""
    \d+            :       "" 
    (?:\.\d+)?     :       ""
  )                :       "" 
  \ +              :       ""
  (?:in|ft|cm|m)   :       ""
|                  :       ""
  Weight\ +        :       ""
  (?<Weight>\d+)   : match 1+ digits in capture group  'Weight'
  \ +              : similar to above       
  (?:oz|lb|g|kg)   :       ""
)                  : end non-capture group
\b                 : assert word boundary

To allow "Length" to be expressed in fractional amounts, change

(?<Length>
  \d+
  (?:\.\d+)?
)

to

(?<Length>
  \d+
  (?:\.\d+)?
|               : or
  \d+-\d+\/\d+  : match 1+ digits, '-' 1+ digits, '/', 1+ digits
)

Fractional values

To add an element to the alternation for "Electical", append a pipe (|) at the end of the "Weight" row and insert the following before the last right parenthesis.

  Electrical\ +      : match 'Electrical' then 1+ spaces 
  (?<Electrical>     : begin capture group 'Electrical'
    [^ ]+            : match 1+ characters other than spaces
  )                  : close named capture group
  \ +                : match 1+ spaces
  (?:VAC|Hz|[aA]mps) : match 'VAC', 'Hz' or 'amps' in a
                       non-capture group

Here I've made the elecrical value merely a string of characters other than spaces because values of 'Hz' (e.g., 50-60) are different than the those for 'VAC' and 'amps'. That could be fine-tuned if necessary.

Add Electrical

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Nice ++ I think you can even save a char with `c?m` and `k?g` :-) – The fourth bird Jul 15 '20 at 22:14
  • 1
    @Thefourthbird, I'll keep that in mind if I need to use this solution at Code Golf. – Cary Swoveland Jul 15 '20 at 22:21
  • I tried extracting using the above expression from the text below, but it missed all the specifications. Is there any fix I should add? `Electrical 120 VAC 50/60 Hz 6.8 amps Dimensions Length 17-13/16 in. Width 7-3/4 in. Height 9-1/4 in. Weight 9 Ibs., 14 oz. Motor 1 HP.` – poisonedivy Jul 15 '20 at 22:57
  • 1
    poison, I've done an edit to address your question. It should help you add new elements to the alternation. See if you can do that for `"Motor 1.5 hp"`, `"Motor 1 HP"`, etc. Try it at one of my links. – Cary Swoveland Jul 15 '20 at 23:49
  • Thank you so much, @CarySwoveland! This really helped. For some reason, the tester and my IDE give me completely different matches. I updated the post with more details. Is there something I should change? – poisonedivy Jul 16 '20 at 02:03
  • poison, I don’t know Python so I can’t help you with that. Perhaps another reader can. @Thefourthbird, can you help? – Cary Swoveland Jul 16 '20 at 02:16
  • @CarySwoveland I ran a test, but I get the expected values `https://ideone.com/1OHasI` as re.findall returns the values of the capturing groups. I also don't see the value `17-13/16` that the OP gets in return in the example data. – The fourth bird Jul 16 '20 at 07:48