1

I have about 500 json files with comments in them. Trying to update a field on the json file with a new value, throws an error. I managed to use commentjson to remove strings like this // some text and the json file updates and throws no errors.

But there is about 100 json files with comments like this:

  /*

   1. sometext.
        i. sometext
        ii. sometext 
   2. sometext

  */

Commentjson just crashes when /* exists. If I remove /* and run the code, it will work and update and remove any //. How can I write some code to manage /* and all text between /* */?

This is my current code that can remove //

with open(f"{i['Location']}\\{file_name}",'r') as f:
    json_info = commentjson.load(f) #Gets info from the json file
    json_info['password'] = password

    with open(f"{i['location_Daily']}\\{file_name}",'w') as f:
        commentjson.dump(json_info,f,indent = 4) #updates the password   
        print("updated")
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Lewis Green
  • 67
  • 1
  • 8
  • Why you want to remove them? – Olvin Roght Sep 03 '21 at 11:08
  • 1
    @OlvinRoght Comments are not even valid JSON, so most JSON parsers will blow up trying to read these files (and whoever created them should get a stern talking to ;) – Iguananaut Sep 03 '21 at 11:09
  • @Iguananaut, there're JSON parsers which support JSON5 standard. – Olvin Roght Sep 03 '21 at 11:10
  • 2
    @Iguananaut: The `commentjson` package does support comments: *commentjson (Comment JSON) is a Python package that helps you create JSON files with Python and JavaScript style inline comments*. Just not this style. – Martijn Pieters Sep 03 '21 at 11:11
  • I need to remove these because when I try to update an value that contents comments, my code throws an error and wont update. So far I've managed to handle the comments but the /* is causing issues now. – Lewis Green Sep 03 '21 at 11:11
  • *Just not this style* Right, so it's a good reason to want to remove them (and in general so that standard JSON parsers can read the files). – Iguananaut Sep 03 '21 at 11:12

2 Answers2

5

You can use another library such as json5 or pyjson5 or anything that supports JSON5

import json5
import pyjson5

data = '''
{
    "something": [
        ["any"],
        ["thing", "here", 10]    // This is comment 1
    ],
    /* While this
    is
    comment 2 */
    "car": [
        ["and", "another", "here"], /* Last comment */
    ]
}
'''

print(json5.loads(data))
print(pyjson5.loads(data))

Output

$ python3 script.py 
{'something': [['any'], ['thing', 'here', 10]], 'car': [['and', 'another', 'here']]}
{'something': [['any'], ['thing', 'here', 10]], 'car': [['and', 'another', 'here']]}
  • It's important to notice that `pyjson5` is **much faster** than `json5` and significantly faster than pure python `json`. Check: [Performance](https://pyjson5.readthedocs.io/en/latest/performance.html) section of `pyjson5` docs; *Known Issues* section in [`json5`](https://pypi.org/project/json5/) project description. – Olvin Roght Sep 03 '21 at 11:24
1

You have a few options:

  • Read the whole file into a string, then use a regular expression to pre-process the text. E.g.:

    with open(...) as f:
        json_text = f.read()
    # remove everything from '/*' to '*/' as long as it is either
    # - a '*' character that is *not* followed by '/'
    # - any character that is not '*'
    without_comments = re.sub(r"/\*(?:\*(?!/)|[^*])*\*/", "", json_text)
    json_info = commentjson.loads(without_comments)
    

    Note that this approach is not going to work if there are also JSON strings with the /* and */ inside of them. A regex is not a JSON parser.

  • try to update the parser that the commonjson project uses to parse out JSON. Looking at the project source code, they use the Lark parsing library, so you could monkey patch the module with additional grammar.

    I note that the main branch already has a grammar rule defining multi-line comments:

    COMMENT: "/*" /(.|\\n)+?/ "*/"
           | /(#|\\/\\/)[^\\n]*/
    

    but that is not yet part of their release. You can, however, re-use that rule:

    from commentjson import commentjson as implementation
    from lark.reconstruct import Reconstructor
    
    serialized = implementation.parser.serialize()
    for tok in serialized["parser"]["lexer_conf"]["tokens"]:
        if tok["name"] != "COMMENT":
            continue
        if tok["pattern"]["value"].startswith("(#|"):
            # only supports `#` or `//` comments, add block comments
            tok["pattern"]["value"] = r'(?:/\*(?:\*(?!/)|[^*])*\*/|(#|\/\/)[^\n]*)'
        break
    
    implementation.parser = implementation.parser.deserialize(serialized, None, None)
    

    I used my own regex in that grammar update rather than the version used by the project.

  • Find a different library to parse the input. There are several options that claim to support parsing JSON with the same syntax:

    I have not tried any of these nor have anything to say about their usability or performance.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343