-1

Here's a weird one. I'm given an ill-conceived input string that is a list of JSON blobs, separated commas. e.g.:

string input = "{<some JSON object>},{JSON_2},{JSON_3},...,{JSON_n}"

And I have to convert this to an actual list of JSON strings (List<string>).

For context, the unsanitary "input" list of JSONs is read in directly from a .txt file on disk, produced by some other software. I'm writing an "adapter" to allow this data to be consumed by another piece of software that knows how to interpret the individual JSON objects contained within the list. Ideally, the original software could have output one file per JSON object.


The "obvious" solution (using String.Split):

List<string> split = input.Split(',').ToList();

would of course fail to escape commas present within the JSON objects ({}) themselves


I was considering a manual approach - walking the string character-by-character and only splitting out a new element if the count of { is equal to the count of }. Something like:

List<string> JsonBlobs = new List<string>();
int start = 0, nestingLevel = 0;
for (int i = 0; i < input.Length; i++)
{
    if (input[i] == '{') nestingLevel++;
    else if (input[i] == '}') nestingLevel--;
    else if (input[i] == ',' && nestingLevel == 0)
    {
        JsonBlobs.Add(input.Substring(start, i - start));
        start = i + 1;
    }
}

(The above likely contains bugs)


I had also considered adding JSON array braces on either end of the string ([]) and letting a JSON serializer deserialize it as a JSON array, then re-serialize each of the array elements one at a time:

List<string> JsonBlobs = Newtonsoft.Json.Linq.JArray.Parse("[" + input + "]")
    .Select<Newtonsoft.Json.Linq.JToken, string>(token => token.ToString()).ToList();

But this seems overly-expensive, and could potentially result in newly serialized JSON representations that are not exactly equal to the original string contents.


Any better suggestions?

I'd prefer to use some easily-understandable use of built-in libraries and/or LINQ if possible. Regex would be a last resort, although nifty regex solutions would also be interesting to see.

Alain
  • 26,663
  • 20
  • 114
  • 184
  • 1
    So if you deserialize them as an array and then serialize them as a list of strings, and the objects are structurally and value-wise identical even if the strings don't match the input parts exactly, what's the harm? It may not be the most efficient way to do it, but if it functions, do that now and come up with something better later. – madreflection Feb 18 '20 at 19:00
  • Pick duplicate you like from https://www.bing.com/search?q=c%23%20json%20fragment%20SupportMultipleContent%20%20site%3Astackoverflow.com (FM - https://www.newtonsoft.com/json/help/html/ReadMultipleContentWithJsonReader.htm) – Alexei Levenkov Feb 18 '20 at 19:58
  • 1
    Deserialization of comma-separated JSON is now supported directly by Json.NET by setting `JsonReader.SupportMultipleContent = true`; see [Additional text encountered after finished reading JSON content:](https://stackoverflow.com/a/50014780/3744182). If you really need each blob *as a string* you can deserialize each one to a `JRaw`, see [Efficiently get full json string in JsonConverter.ReadJson()](https://stackoverflow.com/q/56944160/3744182). – dbc Feb 18 '20 at 20:43

2 Answers2

3

Trying to parse this out using your own rules is fraught. You noticed the problem where JSON properties are comma-separated, but also bear in mind that JSON values can include strings, which could contain braces and commas, and even quote characters that have nothing to do with the JSON structure.

{"John's comment": "I was all like, \"no way!\" :-}"}

To do it right, you're going to need to write a parser capable of handling all the JSON rules. You're likely to make mistakes, and unlikely to get much value out of the effort you put into it.

I would personally suggest the approach of adding brackets on either side of the string and deserializing the whole thing as a JSON array.

I'd also suggest questioning the requirement to convert the result to a list of strings: Was that requirement based on someone's assumption that producing a list of strings would be simpler than producing a list of JObjects or a list of some specific serialized type?

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • Regarding the string requirement, JSON is a textual representation of an object graph, so if, for example, you have spaces inside your braces on input and they're not included in the output, that's not a substantive difference. If that matters to you, that's the reason to question the requirement. – madreflection Feb 18 '20 at 18:48
  • Great point. As for the nature of the requirement, I will edit it into the post for context. – Alain Feb 18 '20 at 18:51
  • 1
    @Alain: That context doesn't give me any reason to believe that you actually need to produce the exact JSON strings that you're given. I'd start by throwing brackets around it and JSON-serializing it. I think JSON.NET is able to do some streamed deserialization, so you could probably eat the input one object at a time and output the result into a separate file. Only if you find that there are serious performance issues with that approach, you can look at writing a more optimized version. – StriplingWarrior Feb 18 '20 at 19:02
  • Agreed, while it would be "nice" to leave the original strings otherwise untouched, it should have no meaningful impact on the final use of the data. My fear (from past experience) was that Newtonsoft would e.g. deserialize UTC DateTimes and re-serialize as locale-specific DateTimes off the back, which can have subtle impacts (like when the UTC DateTime had a value of DateTime.Min, and locale is -3 hours) – Alain Feb 18 '20 at 19:10
  • JSON doesn't have any temporal types, so if you only use LINQ to JSON types defined in the `Newtonsoft.Json.Linq` namespace (`JArray.Parse` does that), no conversion will take place because they won't be parsed as anything but strings. – madreflection Feb 18 '20 at 19:12
  • @Alain: madreflection is right. If you deserialize straight to any kind of JToken (JArray, JObject, JValue), there's no conversion or loss of data from the original string. These types are specifically built to mirror the structure of JSON itself. – StriplingWarrior Feb 18 '20 at 19:29
1

You can try splitting on:

(?<=}),(?={)

but this of course assumes that a JSON string does not literally contain a sequence of },{ such as:

{"key":"For whatever reason, },{ literally exists in this string"}

it would also fail for an array of objects such as:

{"key1":[{"key2":"value2"},{"key3":"value3"}]}

:-/

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77