0

I am using Python 3.8

I have a string that looks like this:

'1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'

I want to parse it into a list of tuples like this:

[(1, 'New Bathroom'),
 (2, 'New Kitchen'),
 (3, 'New Garden'),
 (4, 'Caribbean Holiday')
]

This is what I have managed to come up with so far, but it looks ugly - is there a more succinct way of matching.

"^([0-9]{1,2}){1}\-[aA-zZ\s+]+"

it would be useful if you could explain the improved matching logic too.

Homunculus Reticulli
  • 65,167
  • 81
  • 216
  • 341

5 Answers5

2

You can try this:

matches = re.findall(r"(\d+)-(.*?)(?= \d|$)", s)
results = [(int(n), m) for n, m in matches]
  • (\d+) matches a sequence of digits.
  • (.*?) matches a sequence of arbirary characters in a non-greedy manner.
  • Finally, positive lookahead (?= \d|$) checks if what follows is either a space and a digit or the end of the string.
bb1
  • 7,174
  • 2
  • 8
  • 23
2

Your pattern ^([0-9]{1,2}){1}\-[aA-zZ\s+]+ starts with an anchor ^ which limits the matching to the start of the string.

You can omit {1} and the ranges in the character class are not the same as [a-zA-Z] as A-z matches more characters.

Adding \s in the single character class can possible also match only spaces or newlines, so 4- would also match.


You can use a pattern with 2 capture groups, and use re.findall to return the capture group values in tuples, and end the match with a-zA-Z chars to not match spaces only.

\b([0-9]{1,2})-([a-zA-Z]+(?:\s+[a-zA-Z]+)*)\b

The pattern matches:

  • \b A word boundary to prevent a partial match
  • ([0-9]{1,2}) Capture group 1, match 1-2 digits
  • - Match a hyphen
  • ( Capture group 2
    • [a-zA-Z]+ Match 1+ chars a-zA-Z
    • (?:\s+[a-zA-Z]+)+ Optionally repeat 1+ whitespace chars and 1+ chars a-zA-Z
  • )* Close group 2
  • \b A word boundary

Regex demo

For example

import re

s = '1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'
pattern = r'\b([0-9]{1,2})-([a-zA-Z]+(?:\s+[a-zA-Z]+)*)\b'
print(re.findall(pattern, s))

Output

[('1', 'New Bathroom'), ('2', 'New Kitchen'), ('3', 'New Garden'), ('4', 'Caribbean Holiday')]
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

You can use re with a list comprehension:

import re
s = '1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'
r = [(int((k:=i.split('-'))[0]), k[1]) for i in re.findall('\d+\-\w+\s\w+', s)]

Output:

[(1, 'New Bathroom'), (2, 'New Kitchen'), (3, 'New Garden'), (4, 'Caribbean Holiday')]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
1
import re

txt = '1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday'
m = re.findall(r'(\d+)-([A-Za-z ]+)', txt) # the pattern contains two capture groups (number)(text)
out = [(int(v[0]), v[1].strip()) for v in m] # handle the capture groups (number,text) and make the list of tuples
print(out)

Prints:

[(1, 'New Bathroom'), (2, 'New Kitchen'), (3, 'New Garden'), (4, 'Caribbean Holiday')]
Алексей Р
  • 7,507
  • 2
  • 7
  • 18
1

Just use two expressions, one to find an item, one to separate it:

import re
 
pattern = re.compile(r'\d+-.+?(?=\s+\d-|$)')
item_pattern = re.compile(r'(\d+)-(.+)')
 
text = "1-New Bathroom 2-New Kitchen 3-New Garden 4-Caribbean Holiday"
 
result = [item.groups()
          for chunk in pattern.findall(text)
          for item in [item_pattern.search(chunk)]]
 
print(result)

This yields

[('1', 'New Bathroom'), ('2', 'New Kitchen'), ('3', 'New Garden'), ('4', 'Caribbean Holiday')]

See a demo on ideone.com.

Jan
  • 42,290
  • 8
  • 54
  • 79