0

I receive string content that starts with a JSON value (could be simple or complex) and has some additional content afterward. I'd like to be able to parse the JSON document.

I don't have control over the string, so I can't put any kind of delimiter after the JSON content that would enable me to isolate it.

Examples:

"true and some more" - yields <true>
"false this is different" - yields <false>
"5.6/7" - yields <5.6>
"\"a string\""; then this" - yields <"a string">
"[null, true]; and some more" - yields <[null, true]>
"{\"key\": \"value\"}, then the end" - yields <{"key": "value"}>

The issue is the trailing content. The parser expects the input to end and throws an exception:

')' is invalid after a single JSON value. Expected end of data.

There isn't an option in JsonDocumentOptions to allow trailing content.

As a bonus, if you can give a solution that uses ReadOnlySpan<char>, that'd be aweseme.

gregsdennis
  • 7,218
  • 3
  • 38
  • 71
  • You can do this with [tag:json.net] via the `CheckAdditionalContent` setting, see [Discarding garbage characters after json object with Json.Net](https://stackoverflow.com/q/37172263/3744182). There's nothing built into [tag:system.text.json] to do this though. Maybe `Utf8JsonStreamReader` from [this answer](https://stackoverflow.com/a/55429664/3744182) to [Parsing a JSON file with .NET core 3.0/System.text.Json](https://stackoverflow.com/q/54983533/3744182) might do what you need, possibly with appropriate tweaks... – dbc Sep 24 '20 at 20:18
  • 1
    Huh, `Utf8JsonStreamReader` from [this answer](https://stackoverflow.com/a/55429664/3744182) to [Parsing a JSON file with .NET core 3.0/System.text.Json](https://stackoverflow.com/q/54983533/3744182) by [mtosh](https://stackoverflow.com/users/7217527/mtosh) actually works as-is! See https://dotnetfiddle.net/o9Ctba. Didn't really expect that actually. Mark as a duplicate, or add as an answer? – dbc Sep 25 '20 at 18:39
  • You can't deserialize from a `ReadOnlySpan` using `System.Text.Json` though, you can only deserialize from byte spans, sequences or streams. That's because it's designed to deserialize directly from Utf8 byte sequences, not char sequences. – dbc Sep 25 '20 at 18:42
  • I explicitly stated that I'm using System.Text.Json. Suggesting another library isn't a viable solution. – gregsdennis Sep 25 '20 at 21:29
  • I'll give the custom reader a try, but I'd prefer that I don't have to convert _back_ into a string having already converted the source string into the span for processing. – gregsdennis Sep 25 '20 at 21:34
  • *I'd prefer that I don't have to convert back into a string* -- No choice there. `JsonSerializer` and `Utf8JsonReader` work only with byte sequences or spans, so either you do it or the library does it internally. When you call [`JsonSerializer.Deserialize()`](https://github.com/dotnet/runtime/blob/master/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializer.Read.String.cs#L86) internally the library is encoding to utf8, then deserializing from that. – dbc Sep 25 '20 at 23:14
  • Hmm. That answer is specifically for deserializing into an object. I need parsing into a JsonElement. – gregsdennis Sep 26 '20 at 22:29
  • 1
    That's shown in the [fiddle](https://dotnetfiddle.net/o9Ctba). Just deserialize to a `JsonElement`: `jsonStreamReader.Deserialize()`, – dbc Sep 26 '20 at 23:19
  • Okay. So I see the tests, but for some reason, my specific test case isn't working. I have a number followed by a close parenthesis `1)`. I see that `5.6/7` works just fine, but I still get the error for the `)`. – gregsdennis Sep 27 '20 at 00:04
  • It's something inside `JsonReader.Read()` that doesn't like the `)`. – gregsdennis Sep 27 '20 at 00:13
  • Can you [edit] your question to provide a [mcve]? – dbc Sep 27 '20 at 00:22
  • 1
    I figured out why your test cases work: they all use valid JSON characters (`/` is valid for comments). If you add a test case that contains a character that's invalid for JSON, e.g. `+`, `_`, `)`, or `g`, the test fails. (Not sure why `;` works.)... maybe. `:` also breaks it. – gregsdennis Sep 27 '20 at 00:31
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/222136/discussion-between-gregsdennis-and-dbc). – gregsdennis Sep 27 '20 at 00:35

1 Answers1

1

The suggested answer of the custom reader wasn't working for me because the problem existed in the base reader: it just doesn't like certain trailing characters.

Since I still wanted to rely on JsonDocument.Parse() to extract the element for me, I really just needed to find where the element stopped break that bit off as a separate piece and submit that to the parse method. Here's what I came up with:

public static bool TryParseJsonElement(this ReadOnlySpan<char> span, ref int i, out JsonElement element)
{
    try
    {
        int end = i;
        char endChar;
        switch (span[i])
        {
            case 'f':
                end += 5;
                break;
            case 't':
            case 'n':
                end += 4;
                break;
            case '.': case '-': case '0':
            case '1': case '2': case '3':
            case '4': case '5': case '6':
            case '7': case '8': case '9':
                end = i;
                var allowDash = false;
                while (end < span.Length && (span[end].In('0'..'9') ||
                                             span[end].In('e', '.', '-')))
                {
                    if (!allowDash && span[end] == '-') break;
                    allowDash = span[end] == 'e';
                    end++;
                }
                break;
            case '\'':
            case '"':
                end = i + 1;
                endChar = span[i];
                while (end < span.Length && span[end] != endChar)
                {
                    if (span[end] == '\\')
                    {
                        end++;
                        if (end >= span.Length) break;
                    }
                    end++;
                }

                end++;
                break;
            case '{':
            case '[':
                end = i + 1;
                endChar = span[i] == '{' ? '}' : ']';
                var inString = false;
                while (end < span.Length)
                {
                    var escaped = false;
                    if (span[end] == '\\')
                    {
                        escaped = true;
                        end++;
                        if (end >= span.Length) break;
                    }
                    if (!escaped && span[end] == '"')
                    {
                        inString = !inString;
                    }
                    else if (!inString && span[end] == endChar) break;

                    end++;
                }

                end++;
                break;
            default:
                element = default;
                return false;
        }
        
        var block = span[i..end];
        if (block[0] == '\'' && block[^1] == '\'')
            block = $"\"{block[1..^1].ToString()}\"".AsSpan();
        element = JsonDocument.Parse(block.ToString()).RootElement;
        i = end;
        return true;
    }
    catch
    {
        element = default;
        return false;
    }
}

It doesn't care so much about what's in the middle except (for strings, objects, and arrays) to know whether it's in the middle of a string (where it would be valid for the end character to be found) and checking for \-delimited characters. It works well enough for my purposes.

It takes a ReadOnlySpan<char> and an integer by reference. i needs to be the start of the expected JSON value, and it will be advanced to the next character after, if a valid value is found. It also follows the standard Try* pattern of returning a bool with an output parameter for the value.

gregsdennis
  • 7,218
  • 3
  • 38
  • 71