I would advise against using regular expressions for this task, because the language you want to parse is not regular.
You have a character string of multiple key value pairs. The best way to parse this is not to match patterns on it, but to properly tokenize it.
There is a module in the Python standard library, called shlex
, that mimics the parsing done by POSIX shells, and that provides a lexer implementation that can easily be customized to your needs.
from shlex import shlex
def parse_kv_pairs(text, item_sep=",", value_sep="="):
"""Parse key-value pairs from a shell-like text."""
# initialize a lexer, in POSIX mode (to properly handle escaping)
lexer = shlex(text, posix=True)
# set ',' as whitespace for the lexer
# (the lexer will use this character to separate words)
lexer.whitespace = item_sep
# include '=' as a word character
# (this is done so that the lexer returns a list of key-value pairs)
# (if your option key or value contains any unquoted special character, you will need to add it here)
lexer.wordchars += value_sep
# then we separate option keys and values to build the resulting dictionary
# (maxsplit is required to make sure that '=' in value will not be a problem)
return dict(word.split(value_sep, maxsplit=1) for word in lexer)
(split
has a maxsplit
argument, that is much cleaner to use than splitting/slicing/joining.)
Example run:
parse_kv_pairs(
'key1=value1,key2=\'value2,still_value2,not_key1="not_value1"\''
)
Output:
{'key1': 'value1', 'key2': 'value2,still_value2,not_key1="not_value1"'}
The reason I usually stick with shlex rather than using regular expressions (which are faster in this case) is that it gives you less surprises, especially if you need to allow more possible inputs later on. I never found how to properly parse such key-value pairs with regular expressions, there will always be inputs (e.g. A="B=\"1,2,3\""
) that will trick the engine.
If you do not care about such inputs, (or, put another way, if you can ensure that your input follows the definition of a regular language), regular expressions are perfectly fine.