I have a string like this
str = 'name: phil age : 23 range: 33, 45 address: "main ave US"'
to be tokenized as
['name: phil', 'age : 23', 'range: 33, 45' 'address: "main ave US"']
I have a string like this
str = 'name: phil age : 23 range: 33, 45 address: "main ave US"'
to be tokenized as
['name: phil', 'age : 23', 'range: 33, 45' 'address: "main ave US"']
Sample string 1
>>> import re
>>> str = 'name: phil age : 23 range: 33, 45 address: "main ave US"'
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"']
Sample string 2
>>> str = 'name: phil age : 23 range: 33, 45 address: "main ave US" abcd : xyz'
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"', 'abcd : xyz']
Sample string 3
>>> str = 'name: phil age : 23 range: 33, 45'
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45']
To trim the leading and trailing spaces of each match you can use this:
>>> list(map(lambda x:x.strip(), re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)))
['name: phil', 'age : 23', 'range: 33, 45']
Regex used is: \w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))
Edge case:
>>> str='word1 word2 name: phil age : 23 range: 33, 45'
>>> list(map(lambda x:x.strip() if ':' in x else list(map(lambda s:s.strip(), x.split())), re.findall(r'\w+\s*:?\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))?' , str)))
[['word1', 'word2'], 'name: phil', 'age : 23', 'range: 33, 45']
Once you have the above structure you can flatten the list using any 1 of the answers given here
This regex should be pretty stable. It only checks for the key name followed by a colon and treats that both as start of a match and as not included end, using a positive lookahead.
Depending on how you want to further process it, you can either use the simple variant:
\w+\s*:.*?(?=(?:\w+\s*:)|$)
This will match the entire key/value pair including all spaces.
Check this regex out on regex101.com
If you're going to split the pairs up at the colon anyway, e.g. to store them in a dictionary, you could as well use this slightly modified regex that returns a tuple (key, value)
for each pair, with leading and trailing spaces already stripped:
(\w+)\s*:\s*(.*?)\s*(?=(?:\w+\s*:)|$)
Check this regex out on regex101.com
Here's a Python example how to use both regexes:
import re
pattern1 = r'\w+\s*:.*?(?=(?:\w+\s*:)|$)'
pattern2 = r'(\w+)\s*:\s*(.*?)\s*(?=(?:\w+\s*:)|$)'
data = 'name: phil age : 23 range: 33, 45 address: "main ave US"'
print('Pattern 1:', re.findall(pattern1, data))
print('Pattern 2:', re.findall(pattern2, data))
Output:
Pattern 1: ['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"']
Pattern 2: [('name', 'phil'), ('age', '23'), ('range', '33, 45'), ('address', '"main ave US"')]