0

I have a string like this

str = 'name: phil age : 23 range: 33, 45 address: "main ave US"' 

to be tokenized as

['name: phil', 'age : 23', 'range: 33, 45' 'address: "main ave US"']
riteshtch
  • 8,629
  • 4
  • 25
  • 38
rdp
  • 2,072
  • 4
  • 31
  • 55
  • This isn't really a pure regex operation. What language are you using? Also, what is the source of this data? It almost looks like JSON, and if it were you would just use a JSON parser. – Tim Biegeleisen May 13 '16 at 05:12
  • Your expected output doesn't even match your description. This is **not** what regex is meant to do. You should use a dedicated lexer/parser. – Amit May 13 '16 at 05:16
  • It is not JSON. The source is a raw string entered as it is in the input box. – rdp May 13 '16 at 05:17
  • @dilip which programming language are you using? – riteshtch May 13 '16 at 05:19
  • The problem as it seems that the string is not following any useful patterns as such e.g. discrepancy between `name:` and `age :`. – AKS May 13 '16 at 05:20

2 Answers2

2

Sample string 1

>>> import re
>>> str = 'name: phil age : 23 range: 33, 45 address: "main ave US"' 
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"']

Sample string 2

>>> str = 'name: phil age : 23 range: 33, 45 address: "main ave US" abcd : xyz' 
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"', 'abcd : xyz']

Sample string 3

>>> str = 'name: phil age : 23 range: 33, 45'
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45']

To trim the leading and trailing spaces of each match you can use this:

>>> list(map(lambda x:x.strip(), re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)))
['name: phil', 'age : 23', 'range: 33, 45']

Regex used is: \w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))


Edge case:

>>> str='word1 word2 name: phil age : 23 range: 33, 45'
>>> list(map(lambda x:x.strip() if ':' in x else list(map(lambda s:s.strip(), x.split())), re.findall(r'\w+\s*:?\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))?' , str)))
[['word1', 'word2'], 'name: phil', 'age : 23', 'range: 33, 45']

Once you have the above structure you can flatten the list using any 1 of the answers given here

Community
  • 1
  • 1
riteshtch
  • 8,629
  • 4
  • 25
  • 38
  • I would have accepted this answer, but as you mentioned, it does not work for all the cases, e.g str = 'name: phil age : 23 range: 33, 45' Thanks for your efforts though – rdp May 13 '16 at 05:30
  • @dilip edited the ans with examples and a new regex .. have a look now – riteshtch May 13 '16 at 05:54
  • Perfect. This works. Thanks. If possible can you also look into one corner case str = 'word1 word2 name: phil age : 23 range: 33, 45' => ['word1', 'word2', 'name: phil ', 'age : 23 ', 'range: 33, 45']. But you answered as per the question and I am accepting it. Brilliant job. Just would be glad if you can handle the above edge case as well. – rdp May 13 '16 at 05:54
  • +1 but you can improve the pattern with a few small modifications. 1) You can shorten the lookahead to just `(?=\w+\s*:|$)` (removed the `\s*` after the colon). 2) You don't need the `"[^"]*"|` part at all. 3) Add `\s*` to the lookahead to get rid of the trailing whitespace. Basically it should look like this: `\w+\s*:\s*.*?(?=\s*\w+\s*:|$)`. _Update to work with OP's new requirement:_ `\w+(?:\s*:\s*.*?(?=\s*\w+\s*:|$))?`. – Aran-Fey May 13 '16 at 06:02
  • @dilip weclome :) sure .. I'm amidst something .. will try to include that and post in sometime – riteshtch May 13 '16 at 06:11
  • @Rawing if the value is double quoted and has a colon, it would break, hence in that case `"[^"]*"` would be necessary; regarding `\s*` removal in the positive look-ahead and adding \s* at the start for trailing spaces is a good point – riteshtch May 13 '16 at 06:15
1

This regex should be pretty stable. It only checks for the key name followed by a colon and treats that both as start of a match and as not included end, using a positive lookahead.

Depending on how you want to further process it, you can either use the simple variant:

\w+\s*:.*?(?=(?:\w+\s*:)|$)

This will match the entire key/value pair including all spaces.

Check this regex out on regex101.com


If you're going to split the pairs up at the colon anyway, e.g. to store them in a dictionary, you could as well use this slightly modified regex that returns a tuple (key, value) for each pair, with leading and trailing spaces already stripped:

(\w+)\s*:\s*(.*?)\s*(?=(?:\w+\s*:)|$)

Check this regex out on regex101.com


Here's a Python example how to use both regexes:

import re

pattern1 = r'\w+\s*:.*?(?=(?:\w+\s*:)|$)'
pattern2 = r'(\w+)\s*:\s*(.*?)\s*(?=(?:\w+\s*:)|$)'
data = 'name: phil age : 23 range: 33, 45 address: "main ave US"' 

print('Pattern 1:', re.findall(pattern1, data))
print('Pattern 2:', re.findall(pattern2, data))

Output:

Pattern 1: ['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"']
Pattern 2: [('name', 'phil'), ('age', '23'), ('range', '33, 45'), ('address', '"main ave US"')]

See this code running on ideone.com

Byte Commander
  • 6,506
  • 6
  • 44
  • 71