0

I am looking to parse patterns like "w x h x l" using regex, so basically the letters w,h,l (and others) with "x" in between. There could be text around the searched expression, and "w x h x l x l x h" would be valid as well.

I have tried the regular expression
(w|h|l|b)(\\s\*x\\s\*(w|h|l|b))+
but I don't understand why this doesn't work.

Examples (with python's re.findall):
"The measurements are (w x h x l): 5x7x3cm" => [(w,h,l)]
"Measurement options are (wxhxl), (hxlxb): Some random stuff" => [(w,h,l),(h,l,b)]
"The measurements, in form wxhxl: 5x7x3cm" => [(w,h,l)]

anotherGatsby
  • 1,568
  • 10
  • 21
schajan
  • 59
  • 5

2 Answers2

3

You can use your pattern with non-capturing groups to extract all matches, and then split each match with x to get the separate chars:

import re

texts = [
    "The measurements are (w x h x l): 5x7x3cm", # => [(w,h,l)]
    "Measurement options are (wxhxl), (hxlxb): Some random stuff", # => [(w,h,l),(h,l,b)]
    "The measurements, in form wxhxl: 5x7x3cm" # => [(w,h,l)] 
]
for text in texts:
    print( [tuple(''.join(x.split()).split('x')) for x in re.findall(r'\b[whlb](?:\s*x\s*[whlb])+\b', text)] )

See the Python demo. Output:

[('w', 'h', 'l')]
[('w', 'h', 'l'), ('h', 'l', 'b')]
[('w', 'h', 'l')]

The \b[whlb](?:\s*x\s*[whlb])+\b pattern matches

  • \b - word boundary
  • [whlb] - a w, h, l or b char
  • (?:\s*x\s*[whlb])+ - one or more repetitions of an x enclosed with zero or more whitespaces and then a w, h, l or b char
  • \b - word boundary
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you so much, this is exactly what I'm looking for. Could you tell me what was wrong with my approach? Your regex looks very similar to mine, but for some reason the outcomes seem completely different – schajan Mar 28 '22 at 10:29
  • 1
    @schajan Please see [re.findall behaves weird](https://stackoverflow.com/a/31915134/3832970). Also, see [How to capture multiple repeated groups?](https://stackoverflow.com/q/37003623/3832970) and [Capturing repeating subpatterns in Python regex](https://stackoverflow.com/q/9764930/3832970). With my approach, the whole matched string is returned, and then I split it with `x` while removing the spaces with a bit of Python code. – Wiktor Stribiżew Mar 28 '22 at 10:31
  • @schajan so, your regex was meant to extract multiple parts of match with a repeated capturing group, which is only supported with PyPi regex in Python, not `re`. The approach without using PyPi regex is the one I chose since you do not have to depend on the extra library. – Wiktor Stribiżew Mar 29 '22 at 08:04
1

if you can make use of the PyPi regex module you can use the group captures and a named capture group:

\b(?<pat>[whlb])(?:\s*x\s*(?<pat>[whlb]))+\b 
  • \b A word boundary to prevent a partial word match
  • (?<pat>[whlb]) Group pat match one of w h l b
  • (?: Non capture group to repeat as a whole
    • \s*x\s*(?<pat>[whlb]) Match an x between optional whitespace chars and again named capture group pat
  • )+ Close the non capture group and repeat it 1+ times to match at least a single x
  • \b A word boundary

See a regex demo for the capture group values and a Python demo.

import regex

pattern = r'(?<pat>[whlb])(?:\s*x\s*(?<pat>[whlb]))+'
s = ("The measurements are (w x h x l): 5x7x3cm\n"
            "The measurements, in form wxhxl: 5x7x3cm\n"
            "Measurement options are (wxhxl), (hxlxb): Some random stuff\n"
            "w x h x l x l x h")

for m in regex.finditer(pattern, s):
    print(tuple(m.captures("pat")))

Output

('w', 'h', 'l')
('w', 'h', 'l')
('w', 'h', 'l')
('h', 'l', 'b')
('w', 'h', 'l', 'l', 'h')
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This is very interesting, thank you! In particular it might help me to generalize to more than single characters. Could you tell me what the "?" in (?) means? – schajan Mar 28 '22 at 10:44
  • @schajan That `(?)` it the notation how to use a named capture group for the pattern. – The fourth bird Mar 28 '22 at 11:02