Consider using the following regular expression, which ties the format of the values and the units of measurement to the element being matched.
\b
(?:
Length\ +(?<Length>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Width\ +(?<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Weight\ +(?<Weight>\d+)\ +(?:oz|lb|g|kg)
)
\b
I've written this with the x
("extended") flag (which ignores whitespace) to make it easier to read. For that reason I needed to have escaped the space characters. (Alternatively, I could have put each in a character class.)
As seen, "Length" and "Width" require the value to be an integer or a float and the units to be any of "in", "ft", "cm" or "m", whereas "Weight" requires the value to be an integer and the units to be any of "oz", "lb", "g" or "kg". It could of course be extended in the obvious way.
Start your engine!
Python's regex engine performs the following operations.
\b : assert word boundary
(?: : begin non-capture group
Length + : match 'Length' then 1+ spaces
(?<Length> : begin named capture group 'Length'
\d+(?:\.\d+)? : match 1+ digits
(?:\.\d+)?
) : close named capture group
\ + : match 1+ spaces
(?:in|ft|cm|m) : match 'in', 'ft', 'cm' or 'm' in a
non-capture group
| : or
Width\ + : similar to above
(?<Width> : ""
\d+ : ""
(?:\.\d+)? : ""
) : ""
\ + : ""
(?:in|ft|cm|m) : ""
| : ""
Weight\ + : ""
(?<Weight>\d+) : match 1+ digits in capture group 'Weight'
\ + : similar to above
(?:oz|lb|g|kg) : ""
) : end non-capture group
\b : assert word boundary
To allow "Length"
to be expressed in fractional amounts, change
(?<Length>
\d+
(?:\.\d+)?
)
to
(?<Length>
\d+
(?:\.\d+)?
| : or
\d+-\d+\/\d+ : match 1+ digits, '-' 1+ digits, '/', 1+ digits
)
Fractional values
To add an element to the alternation for "Electical"
, append a pipe (|
) at the end of the "Weight" row and insert the following before the last right parenthesis.
Electrical\ + : match 'Electrical' then 1+ spaces
(?<Electrical> : begin capture group 'Electrical'
[^ ]+ : match 1+ characters other than spaces
) : close named capture group
\ + : match 1+ spaces
(?:VAC|Hz|[aA]mps) : match 'VAC', 'Hz' or 'amps' in a
non-capture group
Here I've made the elecrical value merely a string of characters other than spaces because values of 'Hz'
(e.g., 50-60
) are different than the those for 'VAC'
and 'amps'
. That could be fine-tuned if necessary.
Add Electrical