1

Suppose you have a json file which includes C style comments

{
  "foo": {
    "default_level": "debug",
    // A comment
    "impl": "xyz"
  },
  "bar": [
    {
      /*This is a comment*/
      "format": "%l%d %c ….",
      "rotation": "daily, 1_000_000",
    }
  ]
}

Before this is json is deserialized, using Java what would be the easiest way to strip these comments off? Lets assume that only single line // and multiline /**/ comments are supported.

Ultimately, i'd like to read in a String representation of the same file but w/o comments:

{
  "foo": {
    "default_level": "debug",
    "impl": "xyz"
  },
  "bar": [
    {
      "format": "%l%d %c ….",
      "rotation": "daily, 1_000_000",
    }
  ]
}
djechlin
  • 59,258
  • 35
  • 162
  • 290
James Raitsev
  • 92,517
  • 154
  • 335
  • 470

3 Answers3

1

Probably better luck processing this as Javascript, since JSON is nearly a subset of Javascript, and JSON + C-like comments would in fact be as nearly a subset of Javascript. Try:

Looking to remove comments from a large amount of javascript files

Basically - just run it through your favorite minifier first. Note that JSON is not a strict subset of Javascript so you will need to jam your nearly-legal JSON into legal Javascript before you could trust a minifier. Fortunately this is solvable by a simple find-and-replace.

Community
  • 1
  • 1
djechlin
  • 59,258
  • 35
  • 162
  • 290
0

Actually a non-trivial problem. I would personally suggest the Comment-Stripper library which IMO does a pretty good job of this. Found here: https://github.com/Slater-Victoroff/CommentStripper?source=cc

More fully functioned and debugged version was forked a while ago, but hopefully that should solve this issue.

Full-disclosure: I wrote this library after asking a similar question and realizing there weren't any great solutions I could find.

Alternately if you just want to remove comments I believe you can do it trivially in Python, which you can just call with Jython.

import json
return json.dumps(json.loads("file.json"))

If you're dead set on Native Java you can do basically the same thing using GSON instead. (http://code.google.com/p/google-gson/) and I assume it's also possible with Jackson (http://jackson.codehaus.org/) though I would suggest the lighter GSON for something this simple.

GSON example:

Gson gson = new Gson();
BufferedReader br = //BufferedReader for your source;
String clean = gson.toJson(gson.fromJson(br, Class.class))

Example is given with the understanding that there is some supporting code that needs to go with it, this example only encapsulates the use of GSON. The rest should be pretty trivial (Make a generic type class), check out the GSON docs if you're really having trouble.

https://sites.google.com/site/gson/gson-user-guide

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
-4

Try this regex.

String jsonData =
"{\n"+
"  \"foo\": {\n"+
"    \"default_level\": \"debug\",\n"+
"    // A comment\n"+
"    \"impl\": \"xyz\"\n"+
"  },\n"+
"  \"bar\": [\n"+
"    {\n"+
"      /*This is a comment*/\n"+
"      \"format\": \"%l%d %c ….\",\n"+
"      /* This is a\n"+
"         multi-line comment */\n"+
"      \"rotation\": \"daily, 1_000_000\",\n"+
"    }\n"+
"  ]\n"+
"}";

System.out.println(
       jsonData.replaceAll("//.*\\n\\s*|/\\*.*?\\n?.*?\\*/\\n?\\s*", "")
);

Output:

{
  "foo": {
    "default_level": "debug",
    "impl": "xyz"
  },
  "bar": [
    {
      "format": "%l%d %c ….",
      "rotation": "daily, 1_000_000",
    }
  ]
}

Note: This won't work if your json could have comment characters as data like

 "comment":"/* this is data */", "impl": "abc//xyz"
Ravi K Thapliyal
  • 51,095
  • 9
  • 76
  • 89
  • This will also remove json content like `"impl": "abc//xyz"` – jlordo Jun 27 '13 at 18:41
  • 4
    Now you have two problems. – djechlin Jun 27 '13 at 19:13
  • @djechlin could you elaborate some more please? – Ravi K Thapliyal Jun 27 '13 at 19:18
  • http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html – djechlin Jun 27 '13 at 19:18
  • 1
    Regex is not the solution, -1 – tomdemuyt Jun 27 '13 at 19:19
  • @djechlin I think it's kinda unfair to down vote a *working* solution based on difference of *philosophy*. – Ravi K Thapliyal Jun 27 '13 at 19:28
  • Your solution is incorrect. The downvote is because your solution is incorrect. You're going to edit it again, then it's still going to be incorrect. Then you're going to edit it one more time, and it will still be incorrect. Then you'll edit it one more time, at which point it will be correct and inscrutable. – djechlin Jun 27 '13 at 19:29
  • @RaviThapliyal Escaping comments is not trivial, regex is meant for trivial problems. – tomdemuyt Jun 27 '13 at 19:38
  • @tomdemuyt Yes, I do understand the train of thought going here. It's like validating email or a piece of html. No regex would match almost all of them but people still use them if its *good enough* for their data set. I mean no need to get fancy with the words: "I have two problems". I was just trying to help. – Ravi K Thapliyal Jun 27 '13 at 19:50
  • @RaviThapliyal Entirely different from using regex for email in which case there is actually a defined regex that will match every email, it's just long and gross. Regex is fundamentally unable to match 100% of higher order grammars like html or comments because these grammars are not context free. http://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages for learning exactly what that means. – Slater Victoroff Jul 01 '13 at 20:35
  • @SlaterTyranus Yes, I mentioned HTML in my comments and I've seen applications use a much less complicated regex for email. So, although it won't match every email address on this planet it was good-enough for the use-case. And, at the time of posting this solution OP had shared a sample input to filter and had not mentioned that data could contain comments within. I do understand the concerns other have shared in comments but I do not think they warranted a down vote because the solution as such was not wrong or incorrect. – Ravi K Thapliyal Jul 01 '13 at 21:17
  • @SlaterTyranus How ironic is that your *CommentStripper* library makes use of regex only. Just goes to prove that we usually want a solution that works for our *data set* rather than aiming for a fool-proof 100%. I had chosen the same approach but unfortunately, others disliked it. – Ravi K Thapliyal Jul 01 '13 at 21:47
  • @RaviThapliyal You're looking at an out-of-date fork, also it doesn't just use regex, it uses regex on grammatical transformations of the original form which reduce it to a context-free grammar by ensuring that it passes the pumping lemma. That said, it got taken over by someone else a while ago. – Slater Victoroff Jul 01 '13 at 21:59