0

I have a large file following this format. Originally I tried parsing this in javascript with JSONStream and ran into some issues. I've been trying to figure out a way to change the duplicate keys so that I can parse this more easily. For example: Instead of content each would have a counter appended - content-1, content-2. This is a very large file so I cannot do it manually, any suggestions on how I can do this or restructure with js would be greatly appreciated!

{
    "Test": {
        "id": 3454534344334554345434,
        "details": {
            "text": "78679786787"
        },
        "content": {
            "text": 567566767656776
        },
        "content": {
            "text": 567566767656776
        },
        "content": {
            "text": 567566767656776
        }
    }
}
BluLotus
  • 139
  • 1
  • 11
  • You don't accept "write your own parser" (like in the other post comments) so I guess that you are asking for a tool or something like that (also like you done in your other post comments). That's out of StackOverflow scope. It seems that the only answer here is "write a parser". – Jorge Fuentes González Oct 04 '19 at 00:40
  • @JorgeFuentesGonzález yes I understand that is a solution but I'm unable to do that in regards to this specific task so I'm curious if anyone has accomplished the above before. – BluLotus Oct 04 '19 at 00:42
  • @JorgeFuentesGonzález Writing a custom parser isn't really necessary here. The parser he's using is fine. I'd write an answer myself, but I don't really get what the problem is with the data he's getting from JSONStream in the first place. The data seems fine to me. – Brad Oct 04 '19 at 00:43
  • Oh, I see. What I can think of, to make it easy, is to search for strings between double quotes and followed by `:`, so you get all the keys. Then you check for duplicates and rename them. This will find duplicates all around the JSON actually, but whatever. – Jorge Fuentes González Oct 04 '19 at 00:44
  • @Brad I'm unable to get the duplicate key data. If I can make the keys non-duplicates It won't be an issue to use JSONStream. With JSONStream I only get one of the `content` values – BluLotus Oct 04 '19 at 00:45
  • @Brad Hm, if JSONStream returns data on each key/value pair, then it could work. I thought that it don't made that as the OP had problems with it (never used it). Also interested in your reply xD – Jorge Fuentes González Oct 04 '19 at 00:46
  • @BluLotus, can you use another language than Javascript outside, just to parse the file, as AWK? with this is juts one line.. – Alejandro Teixeira Muñoz Oct 04 '19 at 00:47
  • 1
    @BluLotus Oh, I think I understand now... the data you're showing is what you're putting in, not what you're getting out? I thought from your original question that the streaming parser was handling your duplicate key names just fine. – Brad Oct 04 '19 at 00:47
  • @Brad yea thats what's going in – BluLotus Oct 04 '19 at 00:49
  • Looking at [JSONStream source code](https://github.com/dominictarr/JSONStream/blob/master/index.js), sure is pretty easy to get around that and receive duplicates (250 lines of code only). Is going to be a bit of try/error work, but debugging step by step should be easy to achieve. TIP: There's a `setHeaderFooter` which seems to create the object that is later streamed. Simply fiddle around that, `hasOwnProperty` and such. – Jorge Fuentes González Oct 04 '19 at 00:51
  • Check out this answer: https://stackoverflow.com/a/28641538/362536 Try switching to Oboe. – Brad Oct 04 '19 at 00:52
  • @Brad thanks, I'm going to check this out – BluLotus Oct 04 '19 at 01:08

1 Answers1

1

I know this is a Javascript question, but I understand you need to load the file in the system and that is a big file.

For this issue, you can use AWK if you are able to use another language outside javascript to parse the file. AWK can be executed under linux, windows bash, etc.

Here is the code:

awk 'BEGIN{a=0}/"content"/{a++;gsub("content","content-"a,$0); print $0}!/"content"/{print $0}' file.json
    {
        "Test": {
            "id": 3454534344334554345434,
            "details": {
                "text": "78679786787"
            },
            "content-1": {
            "content-1": {
                "text": 567566767656776
            },
            "content-2": {
            "content-2": {
                "text": 567566767656776
            },
            "content-3": {
            "content-3": {
                "text": 567566767656776
            }
        }
    }
  • 1
    Hey, nice point. If you still want JavaScript you can run that and then parse the output. Or even better, use this pure JavaScript AWK library: https://github.com/agordon/webawk If that's the way for AWK to work, I'm sure the JavaScript implementation will do it that way also. EDIT: Oh, the problem is that you are looking exactly for "content". Will not detect duplicates automatically. The same thing can be achieved with an easy regex then. – Jorge Fuentes González Oct 04 '19 at 00:53
  • Nice point. didn't know about this lib!! seems to be a good option for those formatting mind-puzzles... – Alejandro Teixeira Muñoz Oct 04 '19 at 00:55
  • @JorgeFuentesGonzález I've tried looking for a regex I could use in text wrangler but I haven't found any thing that relates to this. – BluLotus Oct 04 '19 at 01:05
  • You need a regex that looks for the word `"content"`, get the matches and replace them with an incremental. – Jorge Fuentes González Oct 04 '19 at 10:47
  • In awk is done with the command `"a++;gsub("content","content-"a,$0)"` for every line that has the pattern (this is a regex) `/content/` https://www.math.utah.edu/docs/info/gawk_5.html – Alejandro Teixeira Muñoz Oct 04 '19 at 10:56