2

Lets say I have a line containing arguments splitted with ,

'0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World, how, are, you"'

I want regex in python that splits this sequence into list containing items (for clarity split one item by line)

[
'0xe1b04048', 
'FUTEX_WAIT', 
'0', 
'NULL',
'"Hey, World, how, are, you"'
]

I have tried to make regex with negative lookahead, what can at least process one comma in comment and my plan was to extend it but I didnt managed to do even that. Calling re.split(r",\s(?!\".*,\s.*\")",args)

on

'0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World"'

results in

[
'0xe1b04048', 
'FUTEX_WAIT', 
'0', 
'NULL , "Hey', 
'World"'
]
Vasilis G.
  • 7,556
  • 4
  • 19
  • 29
Smarty77
  • 1,208
  • 3
  • 15
  • 30

3 Answers3

3

You can use the csv module with skipinitialspace=True

Ex:

import csv

with open(filename, "r") as infile:
    reader = csv.reader(infile, delimiter=",", skipinitialspace=True)
    for line in reader:
        print([i.strip("'") for i in line])

Output:

['0xe1b04048', 'FUTEX_WAIT', '0', 'NULL ', 'Hey, World, how, are, you']
Rakesh
  • 81,458
  • 17
  • 76
  • 113
2

You probably should use csv for this. However, if you prefer a pure Python solution (no regex either, however...) you could try this: Split by " first, then split all the even parts by ,. Regardless of whether the list starts with a string element or not, the contents of the string will always be in the odd posiitons.

>>> s = '"start", 0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World, how, are, you", not, a, string, "another, string"'
>>> s.split('"')
['',
 'start',
 ', 0xe1b04048, FUTEX_WAIT, 0, NULL , ',
 'Hey, World, how, are, you',
 ', not, a, string, ',
 'another, string',
 '']

>>> [x.strip() for i, w in enumerate(s.split('"')) 
...            for x in (['"%s"'%w] if i%2 else w.split(", ")) if x]
['"start"',
 '0xe1b04048',
 'FUTEX_WAIT',
 '0',
 'NULL',
 '"Hey, World, how, are, you"',
 'not',
 'a',
 'string',
 '"another, string"']

This is, of course, assuming that there are no nested or escaped quotes.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
0

(Posting this as a second answer, as the approach is very different than the first).

If you really want to use regular expressions for this, you could try this: ".+?"|[^", ]+ This just looks for all parts that are either enclosed in ", or contain neither " nor , or space.

>>> s = '"start", 0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World, how, are,  you", not, a, string, "another, string"'
>>> p = r'".+?"|[^", ]+'
>>> re.findall(p, s)
['"start"',
 '0xe1b04048',
 'FUTEX_WAIT',
 '0',
 'NULL',
 '"Hey, World, how, are, you"',
 'not',
 'a',
 'string',
 '"another, string"']

Again, this will probably break down if there are nested or escaped quotes, and all things considered using csv is probably the better idea.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • This might be a little offtopic from original question, but could you explain what is the difference between `".+?"` and `".*"` ? I thought one means "match as many characters as you can between the quotes but at least one (meaning of `.+`), zero or 1 time (meaning of `?`), which seems the same as `".*"`, which I interpret as "match zero or more character between the quotes". I tried it and the`".*"` approach is not working, I guess that is because there is another " at the end of string, and it matches largest string it can. But why does this not happen with `".+?"`? – Smarty77 Jul 19 '18 at 15:09
  • 1
    @Smarty77 `.+?` is not the same as `(.+)?`, which would indeed be `.*`. The `?` makes it non-greedy. `.*` or `.+` would match everything from the first opening `"` until the last closing `"`, whereas `.*?` will match only up to the _next_ `"`, i.e. the individual strings. – tobias_k Jul 19 '18 at 15:20