0

I want to take a string like this:

enabled='false' script='var name=\'Bob\'\\n ' index='0' value=''

and convert it into a JSON type format:

{'enabled': 'false', 'script': 'var name=\'Bob\'\\n ', 'index': '0', 'value': ''}

but I cannot for the life of me figure out a regex or a combination of splitting the string that will produce the result.

The values can have any specials characters in the them and will always escape single quotes and backslashes.

Is there any way to get the regex in Python to stop after finding the first match?

For example, this:

import re
re.findall('[a-zA-Z0-9]+=\'.*\'', line)

will match the entire string instead and won't stop at

['stripPrefix=\'false\'', ....]

like I would like it to.

Jason
  • 23
  • 4

4 Answers4

1

First of all, I assume that you have a mistake in your input string: quote before "Bob" should be escaped.

If my assumption is correct I would use regex code like this:

>>> line = r"""enabled='false' script='var name=\'Bob\'\\n ' index='0' value=''"""
>>> re.findall(r"([a-zA-Z]*)='((?:[^'\\]|\\.)*)'\s*", line)
[('enabled', 'false'), ('script', "var name=\\'Bob\\'\\\\n "), ('index', '0'), ('value', '')]
  • [^'\\] match any symbol except quote and backslash
  • \\. match backslash and one more symbol
  • ([^'\\]|\\.) matches either of previous cases
  • (?:[^\\]|\\.) does the same but doesn't capture match into result (check https://docs.python.org/2.7/library/re.html)
  • (?:[^'\\]|\\.)* repeat any times
Dmitry Ermolov
  • 2,077
  • 1
  • 15
  • 24
0
>>> line = "enabled='false' script='var name=\\'Bob\\'\\n \\\\' index='0' value=''"
>>> print line
enabled='false' script='var name=\'Bob\'\n \\' index='0' value=''
>>> groups = re.findall(r"([a-zA-Z0-9]+)='((?:\\.|[^\'])*)'", line)
>>> for name, value in groups:
...     print name
...     print value
... 
enabled
false
script
var name=\'Bob\'\n \\
index
0
value

>>> import json
>>> print json.dumps(dict(groups))
{"index": "0", "enabled": "false", "value": "", "script": "var name=\\'Bob\\'\\n \\\\"}

The regex is based on this answer.

Note that Python strings can use either single or double quotes. If your string literal contains one of those, use the other. If it contains both, use triple quotes: """. This way you don't have to awkwardly escape the quotes. The r prefix denotes a raw string and also lets you cut down on escaping: in this case it allows me to write \\ instead of \\\\!

Community
  • 1
  • 1
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
0

First, I assume your example input is missing a backslash to escape the single-quote before Bob.

Second, the expected output provided is not strictly json, as json uses double-quotes. My solution will yield you a standard json string.

I chose an approach to parse the string properly into memory and then serializing it to json rather than trying to transform it to json directly. The regex and unescape part matches the key-value pairs in the input, and replaces escaped characters in the value to have an exact string representation of the values. At this time one could even build a python dictionary of these values, and dump it to json. Unfortunately python dicts doesnt retain insertion order, so the output has a random order of entries. To keep the order, treat the parsed values as a stream of key-value pairs, and use a custom json serializer, like this:

import re
import json

ESCAPES = {
  "n": "\n",
  "t": "\t",
  # ...
}
def _escapematch(m):
  x = m.group(1)
  return ESCAPES.get(x, x)

def unescape(literal):
  return re.sub(r"\\(.)", _escapematch, literal)

def parse_pairs(line):
  return (
    (key, unescape(val))
    for key, val in
    re.finditer(r"([a-zA-Z0-9]+)='((?:[^\\']|\\.)*)'", line)
  )

def convert_to_json(line):
  return json.dumps(dict(parse_pairs(line)))

def dumps_json_object(o):
  return "{" + ", ".join(
    json.dumps(k) + ": " + json.dumps(v)
    for k,v in o
  ) + "}" 

def convert_to_json_keep_order(line):
  return dumps_json_object(parse_pairs(line))

line = """
enabled='false' script='var name=\\'Bob\\'\\\\n ' index='0' value=''
"""

print(convert_to_json(line))
# {"value": "", "enabled": "false", "index": "0", "script": "var name='Bob'\\n "}
# Note the random order at every execution

print(convert_to_json_keep_order(line))
# {"enabled": "false", "script": "var name='Bob'\\n ", "index": "0", "value": ""}
Tamas Hegedus
  • 28,755
  • 12
  • 63
  • 97
  • If you used a raw string literal for `line`, you wouldn't have to double up all those backslashes. – PaulMcG Apr 22 '16 at 07:17
0

Pyparsing is useful here, especially if you get more complex inputs. See comments in source code below:

from pyparsing import *

EQ = Suppress('=')
key = Word(alphas, alphanums)
value = QuotedString("'", escChar="\\")
parser = OneOrMore(Group(key + EQ + value))

# multiplication with an integer or tuple works too
#  parser = 4 * Group(key + EQ + value)
#  ONE_OR_MORE = (1,)
#  parser = ONE_OR_MORE * Group(key + EQ + value)


sample = r"""
    enabled='false' script='var name=\'Bob\'\\n ' index='0' value=''
"""

# parse the sample string
res = parser.parseString(sample)

# pretty-print parsed results
res.pprint()

# convert results to list and make a dict from it
print(dict(res.asList()))


# alternatively, make the parser do the dict-building
parser = Dict(OneOrMore(Group(key + EQ + value)))
res = parser.parseString(sample)

# parsed results look like a list
res.pprint()

# but Dict will define key-values to make a dict-like return object
print(res.dump())
print(res['enabled'])
print(res.keys())

# or access fields using object.attribute notation
print(res.enabled)

prints:

[['enabled', 'false'],
 ['script', "var name='Bob'\\\n "],
 ['index', '0'],
 ['value', '']]

{'index': '0', 'enabled': 'false', 'value': '', 'script': "var name='Bob'\\\n "}

[['enabled', 'false'],
 ['script', "var name='Bob'\\\n "],
 ['index', '0'],
 ['value', '']]

[['enabled', 'false'], ['script', "var name='Bob'\\\n "], ['index', '0'], ['value', '']]
- enabled: false
- index: 0
- script: var name='Bob'\

- value: 

false

['index', 'enabled', 'value', 'script']

false
PaulMcG
  • 62,419
  • 16
  • 94
  • 130