1

I have a input string like this:

'{ query: { and: [ { and: [ { _t: "Manifest" }, { or: [ { and: [ { _i: { gt: "53b2616fe4b028359ac3fea4" } } ] } ] }, { _s: "active" } ] }, { ENu_v: { elemMatch: { EOJ_v: { in: [ "*", "Production", "QA    " ] } } } } ] }, orderby: { _i: 1 } } '

I want to change it to a dictionary.

a = '{ query: { and: [ { and: [ { _t: "Manifest" }, { or: [ { and: [ { _i: { gt: "53b2616fe4b028359ac3fea4" } } ] } ] }, { _s: "active" } ] }, { ENu_v: { elemMatch: { EOJ_v: { in: [ "*", "Production", "QA    " ] } } } } ] }, orderby: { _i: 1 } } '

json.loads(a)

but this will throw a exception since query should be "query", and should be "and" and so on.

so I want to change all the strings like string to "string", how can I achieve this?

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
zhihuifan
  • 1,093
  • 2
  • 16
  • 30
  • Where does it come from? If you or someone you know created it, I would should suggest to fix it there. Else you would have to do your own custom parsing to differentiate between 'identifiers' and other items ({, [, (, :, integers, etc) – RvdK Jul 02 '14 at 09:58
  • I suggest to look at http://stackoverflow.com/questions/8815586/convert-invalid-json-into-valid-json and http://stackoverflow.com/questions/18280279/parsing-malformed-json-in-javascript for any regex examples. Maybe they will work on your 'json'. – RvdK Jul 02 '14 at 10:01
  • 1
    Related: [Converting str to dict in python](http://stackoverflow.com/q/24009145), which also repairs JavaScript output to be JSON, plus adds an alternative library to parse this without regular expression tricks. – Martijn Pieters Jul 02 '14 at 10:02

2 Answers2

4

Use re.sub:

In [1]: import re

In [2]: text = '{ query: { and: [ { and: [ { _t: "Manifest" }, { or: [ { and: [ { _i: { gt: "53b2616fe4b028359ac3fea4" } } ] } ] }, { _s: "active" } ] }, { ENu_v: { elemMatch: { EOJ_v: { in: [ "*", "Production", "QA    " ] } } } } ] }, orderby: { _i: 1 } } '

In [3]: re.sub('(\w+):', r'"\1":', text)
Out[3]: '{ "query": { "and": [ { "and": [ { "_t": "Manifest" }, { "or": [ { "and": [ { "_i": { "gt": "53b2616fe4b028359ac3fea4" } } ] } ] }, { "_s": "active" } ] }, { "ENu_v": { "elemMatch": { "EOJ_v": { "in": [ "*", "Production", "QA    " ] } } } } ] }, "orderby": { "_i": 1 } } '

Note that you have to use a raw-string literal (or escape \1 as \\1) for the replacement text, otherwise you wont get your expected output.


I have assumed that your text doesn't contain "strange" things like:

  • colons inside a value (e.g. {a: "some:string"}; the "some:string" isn't preserved by this solution)
  • complex strings that contain nested structure (e.g. {a: "{b : \"hello\"}"})

If these assumptions don't hold you have to actually parse the text, and you cannot safely transform it using regexes alone.

The ast module together with the codegen third party module makes it easy to manipulate such data. For example you can create a NodeTransformer subclass such as:

class QuoteNames(ast.NodeTransformer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._inside_dict = False
    def visit_Name(self, node):
        if self._inside_dict:
            return ast.copy_location(ast.Str(node.id), node)
        else:
            return node
    def visit_Dict(self, node):
        self._inside_dict = True
        self.generic_visit(node)
        self._inside_dict = False
        return node

And use it as:

import ast, codegen
codegen.to_source(QuoteNames().visit(ast.parse(text))

However your sample text is not a syntactically valid literal because some brackets aren't well-matched (which is probably an error in your example), there are some string values with missing ending quotes and you cannot use and or or in identifiers.

If you can fix the format to match the python syntax then the above solution is much more robust than the one using regexes. However if this is not possible you'd have to write your own parser for it, or look for a third party module that is able to do that.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • This will fail on any word that has already ":" in it, like this input '{ query: "bla:a"}'. – Seb D. Jul 02 '14 at 10:00
  • 1
    I've used `re.sub(r'(?:^|(?<=[{,]))\s*(\w+)(?=:)', r' "\1"', text, flags=re.M)` in the past, looking for commas or opening braces or the start of a line before. – Martijn Pieters Jul 02 '14 at 10:06
  • @SébastienDeprez Yes, but there's no such a thing in the OP example, so I assume that `text` is simple. Obviously if you want to take into account everything (including things like nested "dicts" inside a value string) you have to actually parse the thing, because regex wont do in the same way as regex wont work to parse HTML. – Bakuriu Jul 02 '14 at 11:18
1

You can match the following:

'(\w+):'

and replace with:

'"\1":'

where \1 is the first captured group.

You can see it in action here: DEMO

sshashank124
  • 31,495
  • 9
  • 67
  • 76