Fixing invalid JSON

Question

I am working with a very large .json file that is a long series of curly brace pairs, each representing a line from a book. Here are two such pairs. They just continue over and over:

  {"index":{"_index":"tolstoy","_id":27}}
    {"type":"line","line_id":100,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"Weep. It will do you good"}
  {"index":{"_index":"tolstoy","_id":28}}
    {"type":"line","line_id":101,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"I will call Pyotr."}

From what I can tell the first _id is the id of the item. It goes from 0 to 200,000 and corresponds to total lines in the JSON file. In turn, line_id is the line in the particular work. Speech number is the speech by a character.

This is not valid json (as per JSON lint). I have the actual file so can modify it in any way that I want.

I would imagine it ought to be an array of line objects, so I should place square brackets around the entire thing. And I think should place commas between the entries. However, when I do that it still is invalid. I can make it valid, by putting a comma between the index and type lines but that does not make sense intuitively.

How can I modify the above so that it is nice well formed JSON that I can use in an api.

Thanks for any suggestions.

I recommend you see this post and check the first answer [link](https://stackoverflow.com/questions/11639886/how-to-read-a-json-file-containing-multiple-root-elements) — Mauro Abrantes, Mar 18 '21 at 02:18
You need to show a few more lines, it looks like your structure is alternating row master-detail so => line1= index record; line2 = detail. if this is correct, then line 3 is the next index and line 4 the record for that index? — Chris Schaller, Mar 18 '21 at 03:36
Chris, the above is one unit. The above is what gets repeated over and over for successive lines. See edit — user6631314, Mar 18 '21 at 11:46
I should say the above is now two units after the edit. So yes, line 1 is record, line 2 is detail, line 3 is next index and line for is next record. How do I make this proper JSON? — user6631314, Mar 18 '21 at 14:59
Robin, the only language involved is JSON. I just need to know what the correct format would be for this to be proper JSON. Once I know that I can convert it using search and replace or some such technique. — user6631314, Mar 18 '21 at 15:04
The next issue then is the consumer, you say it is for an API, do you mean you are producing the API and serving this data, or are you sending this to an API? Make valid json and make it actually usable data are two different concepts — Chris Schaller, Mar 18 '21 at 20:18
I am putting the API interface to the data on a server and then will consume it myself on an Iphone app. The Swift code on the iphone App requires well formed JSON to work properly. So really I just need to turn this into proper JSON. Unfortunately, I'm a bit weak on JSON syntax. — user6631314, Mar 19 '21 at 01:56
I'm not here to dispute why you would need valid JSON, the real question then is where does this data come from? If you just need to format an existing file into JSON then see my answer, but normally you wouldn't be manually editing files, you would be using some sort of code to retrieve the data and then serializing that as JSON. You _could_ do it manually, but why would you? — Chris Schaller, Mar 19 '21 at 02:02

score 1 · Accepted Answer · answered Mar 19 '21 at 00:11

Data in your current format can be referred to as Interleaved, that is there are multiple distinct types of records however their only relation to each other is by their position in the file. This structure provides an interesting challenge to represent in JSON, which is an object notation aslos referred to as a Graph of hierarchical data.

When converting this data to JSON we should consider what value the structure will add to the final consumer, there is always some value in making the data easier to parse and consume, even if you are not selling the data feed, you can reduce costs if you need to answer less questions about the data later.

The two record types in the set are:

Index
Text Entry

Because there is a 1:1 relationship between these records, there is no value in nesting one of these records within the other in the JSON output. There are also no shared/common fields between the two records, neither is there a linking field between the two. This means that each record is dependent on the other.

From the consumer point of view, the index is meaningless on its own, indexes are usually designed to facilitate searching for or referencing a record quickly.

As the task here is to process the data line by line and we've identified there is little value in producing a nested structure and there are no common fields, this interleaved dataset can be easily merged or flattened into a simple array of single records

After analysing the data we might be able to identify ways to normalise it into object graphs but it would require re-processing of the entire file and the overall benefit to the consuming user would be questionable, data is generally easier to consume by a wider array of tools if it is flat, rather than a nested structure.

To flatten this data, we take the fields from the index record, and inject them into the text entry, in this example I have also flattened out the index structure that was itself a single nested object:

[
  {"index":"tolstoy","_id":27, "type":"line","line_id":100,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"Weep. It will do you good"},
  {"index":"tolstoy","_id":28, "type":"line","line_id":101,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"I will call Pyotr."}
]

See this in json2table.com:

The steps to achieve this:

prefix the file with a new line that starts the array: [
remove {"_index": this is safe to do globally with simple find and replace
join the index and text entry records by replacing the following, including the line break with a comma:
```
}}
  {
```
if you are using visual studio, this can be achieved with a simple find and replace with a regular expression, the following is just an example to illustrate what is going to be replaced:
- after applying the previous transformations
Now append the whole file with the array close character: ]
Append every line the ends with } with a comma

4. and 5. can be reversed, from an automation point of view I found that as an easier way to ensure that last line did not have a comma

things to consider:

Having flattened the index, I would also suggest changing the name of _id to index_id after reading this post back I'm surprised I missed it the first time.

Overall the names used for fields in this file are intuitive, but if the file itself is quite large (and if that is an issue...), then you could save a few bytes by reducing the names of other fields like text_entry to text, speech_number to speechId/speechNo/speechNum,

When reducing names for data integrations you should try to avoid reducing columns to ambiguous terms like line_id to lid or speech_number to sid as this imposes more assumptions on the consumer, which means more time you need to spend documenting or answering questions.

When optimising data feeds, you need to find a balance between what is theoretically achievable and what is practical with the resources available to both you and the consumer.

each case will be different, for different reasons.

If every record in the file has the same value for a field, then that field could be removed entirely, in the current example type has the potential to fall into this category.

If bytes across the wire is a concern, and you want to stick with JSON, then normalising the data into a nested object graph is an option, you should use some sort of mapping or translation service to do this, or code, I wouldn't attempt it by hand, but it might look something like nested arrays in this order:

index (author)
- books
  - speeches
    - text entries

If you were to process this into a normalised structure, it might look something like this:

For a large file (with whitespace removed) that could make a significant difference whilst it is still easy to target the nodes in the graph via expression syntax, importantly no information has been lost.

Chris, this is really an awesome--I would say epic answer. I've marked it correct although I have not yet done all the steps you suggest as that will take some time. But thank you. — user6631314, Mar 19 '21 at 02:01
If you are dealing with a file, then find and replace will work pretty quickly, if you need help writing code to produce the output like this, then please post a new question that includes an example of your code, I'm sure we can help you out pretty quickly (just link it to this post so we can find it) — Chris Schaller, Mar 19 '21 at 02:04
Chris, it's the opposite problem, but I have asked another question about how to get this into a MYSQL database as the file seems too large to json_decode into an array in PHP, the language I'm using on the server. Here is question. https://stackoverflow.com/questions/66736385/import-large-json-file-into-mysql-using-php — user6631314, Mar 21 '21 at 19:03

Fixing invalid JSON

1 Answers1

things to consider:

Linked