2

I understand that there are many similar questions for json parsing when dealing with special escape characters, however i'm unable to find the solution. What i'm trying to do is save the following to a json file which i can later retrieve as a dict using json module from python. My json is something like this

{"head":{"body":{"/^\s+|\s+$":"", "\s+":" "}}}

When i tried to put it in json and loading it gives me a parsing error as is expected since backslash is not escaped. So i corrected it as follows (based on suggestions from SO):

{"head":{"body":{"/^\\s+|\\s+$":"", "\\s+":" "}}}

However when i load it to a dict, although it parses, it gives me the dict as follows:

{"head":{"body":{"/^\\s+|\\s+$":"", "\\s+":" "}}}

and not a single backslash as expected. How to deal with it, so that my \s has only single backslash and not two. Also i thought of going with ast.literal_eval() to read the data but don't want to go that way. Any suggestions on how to go about this.

Mr. Confused
  • 245
  • 2
  • 11
  • You can use replace – soheshdoshi Nov 20 '19 at 17:06
  • try `r"/^\s+|\s+$"` instead, r stands for raw – geckos Nov 20 '19 at 17:06
  • r doesn't work in jsons. I tried that. JSONDecodeError: Expecting property name enclosed in double quotes. Everything has to be in double quotes. – Mr. Confused Nov 20 '19 at 17:08
  • 2
    It seems to be impossible. see this https://stackoverflow.com/questions/49763394/impossible-to-store-json-in-python-with-single-un-escaped-backslash – Rahul Raut Nov 20 '19 at 17:28
  • @RahulRaut : It seems that you are correct. Tbh getting irritated trying to get a single backslash. Have tried various versions, like "\u005C", but it seems that getting a single backslash (\) is impossible. Thanks for sharing the link. So any suggestions on how to deal with this. I'm now thinking ot saving the file as a txt and not as json and then using ast. Will update here if that works – Mr. Confused Nov 20 '19 at 17:53
  • I dont think saving file as txt will make much difference however please try it and update it. – Rahul Raut Nov 20 '19 at 17:57
  • @RahulRaut: yes tried it. You are correct, it doesn't make any difference. Even with saving as txt format and then using ast.literal_eval() also gives two backslashes instead of one (\\s instead of \s). This is just so frustrating. So any other suggestion or do i have to go with a dict repalce. – Mr. Confused Nov 20 '19 at 18:01
  • yes i think. dict replace is the only option. – Rahul Raut Nov 20 '19 at 18:07
  • @Tomalak : i have done nothing special in python. Just loading the file in python and the printing it. like this: ``` with open(json_file_path, 'r') as j: contents = json.loads(j.read()) print(contents.get('head').get('body')) ``` if you have a solution to above, your help is appreciated. – Mr. Confused Nov 20 '19 at 18:17
  • That looks reasonable (except that `json.loads(j.read())` is better written as `json.load(j)`). But every comment you received above is plain wrong. You cannot use replace, you cannot use regex, and it's not impossible, either. Your mistake is that you did not create the JSON file properly (i.e. you tried to write it by hand without knowing exactly how JSON works). Don't do that. – Tomalak Nov 20 '19 at 18:41

1 Answers1

1

You have a data structure with a few regular expressions. In Python syntax this would be:

data = {
    'head': {
        'body': {
            r'^\s+|\s+$': '',
            r'\s+': ' '
        }
    }
}

When you convert this data to JSON and store it in a file:

import json

with open('test.json', 'w', encoding='utf8') as fp:
    json.dump(data, fp)

and open the resulting file in a text editor, you will see:

{"head": {"body": {"^\\s+|\\s+$": "", "\\s+": " "}}}

when you JSON-parse this file again:

with open('test.json', encoding='utf8') as fp:
    data = json.read(fp)

print(data)

Python will print this:

{'head': {'body': {'^\\s+|\\s+$': '', '\\s+': ' '}}}

...which is precisely the same thing we had in the first place, except that initially we used raw string literals r'...', but Python's print() will never output this particular format.

The thing you wanted initially in your JSON file:

{"head":{"body":{"/^\s+|\s+$":"", "\s+":" "}}}

is not JSON and there is no reason whatsoever to try and achieve this format.

Conclusion

  • JSON is a string (JSON is never anything but a string, especially it's never an "object" or an "array").
  • JSON strings must be parsed. Do not use a JSON string for anything other than feeding it to a JSON parser (or storing it in a file or database or sending it over the network)
  • Especially never use string operations like replace or regex on JSON strings, as this will easily break them.
  • Use a JSON library to convert data structures to JSON and back, avoid "winging it" and writing JSON by hand. Especially when the data contains complex structures like regular expressions, and you're not 100% certain of JSON syntax rules.
  • There is no reason to ever worry about the number of backslashes in the JSON, because this simply does not matter.
  • The above samples use Python, the same approach applies to any other programming language.
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Thanks for the very informed answer. So just a query as i haven't tried it yet, so when you say that "There is no reason to ever worry about the number of backslashes in the JSON", and you rightly said that i wanted to save these regex patterns and others like it. I intend to use them in a function where key values are used in regex replace kinda functionality. So, i understand that getting a single slash is impossible since its not json format, but wouldn't getting a double slash would change the regex pattern. So whats the way out? Any suggestions? – Mr. Confused Nov 20 '19 at 18:39
  • 1
    I don't know what "way out" you mean. Out of what? The input and the output of the process above are **precisely** the same thing. Try `for regex in data['head']['body']: print(regex)` before and after. – Tomalak Nov 20 '19 at 18:45
  • Understood. Will check it out. And thanks for the help in explaining how json works. – Mr. Confused Nov 20 '19 at 18:47
  • JSON stores \\ to represent \. When you parse it, \\ becomes \ again. Your single backslash never went away, there is nothing you need to do to "keep" or "restore" it. Stop counting how many backslashes there are in the JSON, that's not your concern, that's the JSON parser's concern. – Tomalak Nov 20 '19 at 18:51
  • But then isn't what you are saying in comment above and what you said in the answer contradictory. Because in the answer your regex pattern had \s (one backslash), and then you dump it using json.dump and you had a file containing \\s (two backslashes, so far so good). Then you read the dumped json file using json.load and assign it to a variable named data which you then print and it gives you a dict with \\s (two backslashes and not one as you mentioned in comment above the parser should have done). Thus on parsing \\s doesn't go back to \s, which is the problem i was facing initially. – Mr. Confused Nov 20 '19 at 19:04
  • 1
    No, that's not contradictory. In my answer I'm using raw string literals initially (see update, I've pasted a link that explains them). Python string literals **also** use \\ to represent \, just like JSON does it. When the Python source code gets read, \\ becomes \, just exactly like JSON does it. When you `print()` a string containing a single \ to the console, Python converts it to \\ again. – Tomalak Nov 20 '19 at 19:09
  • Strictly speaking this would be a syntax error in Python: `'\s'`, because "backslash-s" means nothing in strings. "backslash-n" means "newline", so `'\n'` would be valid and contain a single newline character. But to get an *actual* "backslash-s" into a Python string, you really would need to write `'\\s'`. The Python parser tries to be helpful and silently does that for you when it sees `'\s'`, that's why it *seems* as if your regex definition would be okay as `'\s'`, but from a technical perspective, it's not clean. – Tomalak Nov 20 '19 at 19:14
  • Since meticulously escaping all the backslashes quickly becomes tedious when writing e.g. regex in Python, you can use raw string literals. `r'\s'` is the same thing as `'\\s'`. The downside is that you cannot insert special characters like newlines easily anymore compare `len('\n')` and `len(r'\n')`. – Tomalak Nov 20 '19 at 19:18
  • yes. I should have noticed it beforehand itself. Just ran the whole code (\\s indeed works for what i intended to do) because i'm guessing since python was silently helping me by changing \s to \\s, i never noticed earlier and wasted a whole lot of time today. Yes technically using the clean representations would be better from now on. – Mr. Confused Nov 20 '19 at 19:23
  • Knowledge when things (need to) get escaped one way or the other never comes easy. Regex is the next level of escaping here. Technically, regex also uses \\ to mean \. So in a language that does not have raw string literals, if you want to write a regex that matches a single backslash, you would end up with `regex = "\\\\"`, with two layers of escaping going on. – Tomalak Nov 20 '19 at 19:28