9

I am having trouble coding an 'elegant' parser for this requirement. (One that does not look like a piece of C breakfast). The input is a string, key value pairs separated by ',' and joined '='.

key1=value1,key2=value2

The part tricking me is values can be quoted (") , and inside the quotes ',' does not end the key.

key1=value1,key2="value2,still_value2"

This last part has made it tricky for me to use split or re.split, resorting to for i in range for loops :(.

Can anyone demonstrate a clean way to do this?

It is OK to assume quotes happen only in values, and that there is no whitespace or non alphanumeric characters.

Evan Benn
  • 1,571
  • 2
  • 14
  • 20

5 Answers5

12

I would advise against using regular expressions for this task, because the language you want to parse is not regular.

You have a character string of multiple key value pairs. The best way to parse this is not to match patterns on it, but to properly tokenize it.

There is a module in the Python standard library, called shlex, that mimics the parsing done by POSIX shells, and that provides a lexer implementation that can easily be customized to your needs.

from shlex import shlex

def parse_kv_pairs(text, item_sep=",", value_sep="="):
    """Parse key-value pairs from a shell-like text."""
    # initialize a lexer, in POSIX mode (to properly handle escaping)
    lexer = shlex(text, posix=True)
    # set ',' as whitespace for the lexer
    # (the lexer will use this character to separate words)
    lexer.whitespace = item_sep
    # include '=' as a word character 
    # (this is done so that the lexer returns a list of key-value pairs)
    # (if your option key or value contains any unquoted special character, you will need to add it here)
    lexer.wordchars += value_sep
    # then we separate option keys and values to build the resulting dictionary
    # (maxsplit is required to make sure that '=' in value will not be a problem)
    return dict(word.split(value_sep, maxsplit=1) for word in lexer)

(split has a maxsplit argument, that is much cleaner to use than splitting/slicing/joining.)

Example run:

parse_kv_pairs(
  'key1=value1,key2=\'value2,still_value2,not_key1="not_value1"\''
)

Output:

{'key1': 'value1', 'key2': 'value2,still_value2,not_key1="not_value1"'}

The reason I usually stick with shlex rather than using regular expressions (which are faster in this case) is that it gives you less surprises, especially if you need to allow more possible inputs later on. I never found how to properly parse such key-value pairs with regular expressions, there will always be inputs (e.g. A="B=\"1,2,3\"") that will trick the engine.

If you do not care about such inputs, (or, put another way, if you can ensure that your input follows the definition of a regular language), regular expressions are perfectly fine.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
pistache
  • 5,782
  • 1
  • 29
  • 50
  • 1
    I believe `shlex` is a solid production solution and this is a nice example of how to tune it to the problem at hand. However, this answer loses all elegance for me in it's `return` statement -- `split()` the same data twice and then `join()` to clean up after the excessive `split()` just so you can use a dictionary comprehension? How about `return dict(word.split(value_sep, maxsplit=1) for word in lexer)` – cdlane Aug 03 '16 at 17:28
  • Yes, this is way better, I forgot about the `maxsplit` argument when writing, and indeed made it way less elegant when adding support for `=` in values. Thanks for your advice, I edit the answer. – pistache Aug 03 '16 at 21:01
7

Using some regex magic from Split a string, respect and preserve quotes, we can do:

import re

string = 'key1=value1,key2="value2,still_value2"'

key_value_pairs = re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', string)

for key_value_pair in key_value_pairs:
    key, value = key_value_pair.split("=")

Per BioGeek, my attempt to guess, I mean interpret the regex Janne Karila used: The pattern breaks strings on commas but respects double quoted sections (potentially with commas) in the process. It has two separate options: runs of characters that don't involve quotes; and double quoted runs of characters where a double quote finishes the run unless it's (backslash) escaped:

(?:              # parenthesis for alternation (|), not memory
[^\s,"]          # any 1 character except white space, comma or quote
|                # or
"(?:\\.|[^"])*"  # a quoted string containing 0 or more characters
                 # other than quotes (unless escaped)
)+               # one or more of the above
Community
  • 1
  • 1
cdlane
  • 40,441
  • 5
  • 32
  • 81
  • Can you add some explanation about how the regex works. – BioGeek Aug 03 '16 at 08:42
  • 1
    @BioGeek, I attempted per your request, let me know if I succeeded or not! – cdlane Aug 03 '16 at 09:10
  • I usually refrain from using re whenever I can, I achieve that in most of my tasks. Today I stumbled upon a network device logs structure involving spaces as field delimiter, optional quoting for values, while some values involved spaces (which were quoted). I really liked and wanted to use the shlex approach in another answer, but that didn't work, and yours regex does work indeed. Thanks, +1. – 0xc0de Apr 24 '20 at 10:23
3

I came up with this regular expression solution:

import re
match = re.findall(r'([^=]+)=(("[^"]+")|([^,]+)),?', 'key1=value1,key2=value2,key3="value3,stillvalue3",key4=value4')

And this makes "match":

[('key1', 'value1', '', 'value1'), ('key2', 'value2', '', 'value2'), ('key3', '"value3,stillvalue3"', '"value3,stillvalue3"', ''), ('key4', 'value4', '', 'value4')]

Then you can make a for loop to get keys and values:

for m in match:
    key = m[0]
    value = m[1]
2

I'm not sure that it does not look like piece of C breakfast and that it is quite elegant :)

data = {}
original = 'key1=value1,key2="value2,still_value2"'
converted = ''

is_open = False
for c in original:
    if c == ',' and not is_open:
        c = '\n'
    elif c in ('"',"'"):
        is_open = not is_open
    converted += c

for item in converted.split('\n'):
    k, v = item.split('=')
    data[k] = v
Sergey Gornostaev
  • 7,596
  • 3
  • 27
  • 39
1

Based on several other answers, I came up with the following solution:

import re
import itertools

data = 'key1=value1,key2="value2,still_value2"'

# Based on Alan Moore's answer on http://stackoverflow.com/questions/2785755/how-to-split-but-ignore-separators-in-quoted-strings-in-python
def split_on_non_quoted_equals(string):
    return re.split('''=(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string)
def split_on_non_quoted_comma(string):
    return re.split(''',(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string)

split1 = split_on_non_quoted_equals(data)
split2 = map(lambda x: split_on_non_quoted_comma(x), split1)

# 'Unpack' the sublists in to a single list. Based on Alex Martelli's answer on http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
flattened = [item for sublist in split2 for item in sublist]

# Convert alternating elements of a list into keys and values of a dictionary. Based on Sven Marnach's answer on http://stackoverflow.com/questions/6900955/python-convert-list-to-dictionary
d = dict(itertools.izip_longest(*[iter(flattened)] * 2, fillvalue=""))

The resulting d is the following dictionary:

{'key1': 'value1', 'key2': '"value2,still_value2"'}
Kurt Peek
  • 52,165
  • 91
  • 301
  • 526