Python regex, split arguments, ignore commas in quotes

Question

Lets say I have a line containing arguments splitted with ,

'0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World, how, are, you"'

I want regex in python that splits this sequence into list containing items (for clarity split one item by line)

[
'0xe1b04048', 
'FUTEX_WAIT', 
'0', 
'NULL',
'"Hey, World, how, are, you"'
]

I have tried to make regex with negative lookahead, what can at least process one comma in comment and my plan was to extend it but I didnt managed to do even that. Calling re.split(r",\s(?!\".*,\s.*\")",args)

on

'0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World"'

results in

[
'0xe1b04048', 
'FUTEX_WAIT', 
'0', 
'NULL , "Hey', 
'World"'
]

Why not use `csv`? See [Python csv string to array](https://stackoverflow.com/questions/3305926/python-csv-string-to-array) for an example of usage. — Wiktor Stribiżew, Jul 19 '18 at 14:22
You should probably look into the [`csv` module](https://docs.python.org/3/library/csv.html) instead. — Kevin J. Chase, Jul 19 '18 at 14:22
Note that using `csv` will be a little tricky, as the intent is to remove the whitespace following the commas as well. — chepner, Jul 19 '18 at 14:24

score 3 · Accepted Answer · answered Jul 19 '18 at 14:27

You can use the csv module with skipinitialspace=True

Ex:

import csv

with open(filename, "r") as infile:
    reader = csv.reader(infile, delimiter=",", skipinitialspace=True)
    for line in reader:
        print([i.strip("'") for i in line])

Output:

['0xe1b04048', 'FUTEX_WAIT', '0', 'NULL ', 'Hey, World, how, are, you']

score 2 · Answer 2 · answered Jul 19 '18 at 14:37

You probably should use csv for this. However, if you prefer a pure Python solution (no regex either, however...) you could try this: Split by " first, then split all the even parts by ,. Regardless of whether the list starts with a string element or not, the contents of the string will always be in the odd posiitons.

>>> s = '"start", 0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World, how, are, you", not, a, string, "another, string"'
>>> s.split('"')
['',
 'start',
 ', 0xe1b04048, FUTEX_WAIT, 0, NULL , ',
 'Hey, World, how, are, you',
 ', not, a, string, ',
 'another, string',
 '']

>>> [x.strip() for i, w in enumerate(s.split('"')) 
...            for x in (['"%s"'%w] if i%2 else w.split(", ")) if x]
['"start"',
 '0xe1b04048',
 'FUTEX_WAIT',
 '0',
 'NULL',
 '"Hey, World, how, are, you"',
 'not',
 'a',
 'string',
 '"another, string"']

This is, of course, assuming that there are no nested or escaped quotes.

score 0 · Answer 3 · answered Jul 19 '18 at 14:48

0

(Posting this as a second answer, as the approach is very different than the first).

If you really want to use regular expressions for this, you could try this: ".+?"|[^", ]+ This just looks for all parts that are either enclosed in ", or contain neither " nor , or space.

>>> s = '"start", 0xe1b04048, FUTEX_WAIT, 0, NULL , "Hey, World, how, are,  you", not, a, string, "another, string"'
>>> p = r'".+?"|[^", ]+'
>>> re.findall(p, s)
['"start"',
 '0xe1b04048',
 'FUTEX_WAIT',
 '0',
 'NULL',
 '"Hey, World, how, are, you"',
 'not',
 'a',
 'string',
 '"another, string"']

Again, this will probably break down if there are nested or escaped quotes, and all things considered using csv is probably the better idea.

answered Jul 19 '18 at 14:48

tobias_k

81,265
12
120
179

This might be a little offtopic from original question, but could you explain what is the difference between `".+?"` and `".*"` ? I thought one means "match as many characters as you can between the quotes but at least one (meaning of `.+`), zero or 1 time (meaning of `?`), which seems the same as `".*"`, which I interpret as "match zero or more character between the quotes". I tried it and the`".*"` approach is not working, I guess that is because there is another " at the end of string, and it matches largest string it can. But why does this not happen with `".+?"`? – Smarty77 Jul 19 '18 at 15:09
1

@Smarty77 `.+?` is not the same as `(.+)?`, which would indeed be `.*`. The `?` makes it non-greedy. `.*` or `.+` would match everything from the first opening `"` until the last closing `"`, whereas `.*?` will match only up to the _next_ `"`, i.e. the individual strings. – tobias_k Jul 19 '18 at 15:20

Python regex, split arguments, ignore commas in quotes

3 Answers3