Parse string starting with pattern starting with digit and ending white space BEFORE next digit

Question

I am using Python 3.8

I have a string that looks like this:

'1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'

I want to parse it into a list of tuples like this:

[(1, 'New Bathroom'),
 (2, 'New Kitchen'),
 (3, 'New Garden'),
 (4, 'Caribbean Holiday')
]

This is what I have managed to come up with so far, but it looks ugly - is there a more succinct way of matching.

"^([0-9]{1,2}){1}\-[aA-zZ\s+]+"

it would be useful if you could explain the improved matching logic too.

bb1 · Answer 1 · 2021-08-29T15:53:28.160

2

You can try this:

matches = re.findall(r"(\d+)-(.*?)(?= \d|$)", s)
results = [(int(n), m) for n, m in matches]

(\d+) matches a sequence of digits.
(.*?) matches a sequence of arbirary characters in a non-greedy manner.
Finally, positive lookahead (?= \d|$) checks if what follows is either a space and a digit or the end of the string.

edited Aug 29 '21 at 15:53

answered Aug 29 '21 at 15:46

bb1

7,174
2
8
23

score 2 · Accepted Answer · answered Aug 29 '21 at 16:51

Your pattern ^([0-9]{1,2}){1}\-[aA-zZ\s+]+ starts with an anchor ^ which limits the matching to the start of the string.

You can omit {1} and the ranges in the character class are not the same as [a-zA-Z] as A-z matches more characters.

Adding \s in the single character class can possible also match only spaces or newlines, so 4- would also match.

You can use a pattern with 2 capture groups, and use re.findall to return the capture group values in tuples, and end the match with a-zA-Z chars to not match spaces only.

\b([0-9]{1,2})-([a-zA-Z]+(?:\s+[a-zA-Z]+)*)\b

The pattern matches:

\b A word boundary to prevent a partial match
([0-9]{1,2}) Capture group 1, match 1-2 digits
- Match a hyphen
( Capture group 2
- [a-zA-Z]+ Match 1+ chars a-zA-Z
- (?:\s+[a-zA-Z]+)+ Optionally repeat 1+ whitespace chars and 1+ chars a-zA-Z
)* Close group 2
\b A word boundary

Regex demo

For example

import re

s = '1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'
pattern = r'\b([0-9]{1,2})-([a-zA-Z]+(?:\s+[a-zA-Z]+)*)\b'
print(re.findall(pattern, s))

Output

[('1', 'New Bathroom'), ('2', 'New Kitchen'), ('3', 'New Garden'), ('4', 'Caribbean Holiday')]

score 1 · Answer 3 · answered Aug 29 '21 at 15:44

You can use re with a list comprehension:

import re
s = '1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'
r = [(int((k:=i.split('-'))[0]), k[1]) for i in re.findall('\d+\-\w+\s\w+', s)]

Output:

[(1, 'New Bathroom'), (2, 'New Kitchen'), (3, 'New Garden'), (4, 'Caribbean Holiday')]

score 1 · Answer 4 · answered Aug 29 '21 at 15:49

import re

txt = '1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'
m = re.findall(r'(\d+)-([A-Za-z ]+)', txt) # the pattern contains two capture groups (number)(text)
out = [(int(v[0]), v[1].strip()) for v in m] # handle the capture groups (number,text) and make the list of tuples
print(out)

Prints:

[(1, 'New Bathroom'), (2, 'New Kitchen'), (3, 'New Garden'), (4, 'Caribbean Holiday')]

score 1 · Answer 5 · answered Aug 29 '21 at 17:07

Just use two expressions, one to find an item, one to separate it:

import re
 
pattern = re.compile(r'\d+-.+?(?=\s+\d-|$)')
item_pattern = re.compile(r'(\d+)-(.+)')
 
text = "1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday"
 
result = [item.groups()
          for chunk in pattern.findall(text)
          for item in [item_pattern.search(chunk)]]
 
print(result)

This yields

[('1', 'New Bathroom'), ('2', 'New Kitchen'), ('3', 'New Garden'), ('4', 'Caribbean Holiday')]

See a demo on ideone.com.

Parse string starting with pattern starting with digit and ending white space BEFORE next digit

5 Answers5