1

I am working with an API that accepts an HTTP POST request along with URL-form-encoded query data as the payload. The response from the server is chunked, and each chunk contains a single JSON object. I'm trying to read and parse the server's response by chunk from C#, so that I can take the chunk, convert it with Newtonsoft, and do whatever processing I need at that point. The server returns an unknown nubmer of records per query - it could be 0 records, or it could be thousands of chunks.

My research and testing into the typical solutions like HttpClient indicate that these libraries "handle" chunks by just concatenating everything into a single response stream. Furthermore, I've read other posts that indicate that if a server isn't following the spec 100%, it's possible to even get an exception at the end of the stream.

I've considered the following solutions, but none seem optimal:

  1. Read the stream from the HTTP response char-by-char, counting { and } characters to find the start and end of a JSON object. Every time a closing } is found, parse the object. This is incredibly ugly, inefficient, and not generic - it assumes every JSON response is an object and wuold need to be altered if, for example, the server sends a JSON array ([ and ]) instead, or even just single JSON strings per chunk.

  2. Skip HttpRequest/HttpClient entirely and do everything in raw sockets. Then, I can parse the chunk sizes, read exactly that many bytes from the socket stream, and parse accordingly. This would work, except it feels like a lot of "reinventing the wheel" since I have to implement URL encoding for the POST body, header parsing, SSL/TLS, etc. This has all been basically "solved" by HttpClient, so implementing it myself again feels like a bad idea, if for no other reason than I could easily introduce a parsing bug.

  3. Since the server sends a JSON object per chunk, read the entire response, then look for }{ and consider those to be the split point for JSON objects (since in actual JSON there would be a , between two objects that were part of a list). This feels unreliable at best - it assumes there is no whitespace on either side of each chunk's JSON object. This is also inefficient because, if the server were to return millions of records, the entire response would need to be stored in RAM. A response with millions of records could be over 1GB in size total, across hundreds of chunks. While that's not necessarily a problem for a machine with plenty of RAM, it's an unnecessarily inefficient method for parsing data that is streamable by design.

The ideal scenario is some sort of enumerator that reads the HTTP stream by chunk, since the API is producing chunks wherein each chunk represents exactly one JSON object. This is what I considered implementing in option 2, but again, seems like a lot of reinventing the wheel and potential for serious bugs being introduced. The second best option would be a way to get the raw, underlying socket stream from the HttpClient after it has performed the request and parsed the headers -- in other words, a way to get the stream that includes the chunk sizes and separators, so then I can parse that stream directly, extracting the chunk sizes, and basically doing #2 above but without having to write my own HTTP implementation.

What is the best option for me to implement this functionality?

fdmillion
  • 4,823
  • 7
  • 45
  • 82
  • Well i would suggest read entire response into a file. Then use [this](https://stackoverflow.com/questions/43747477/how-to-parse-huge-json-file-as-stream-in-json-net) method for parsing it. Another [link](https://www.robbiecode.com/parsing-big-json-files-with-streamreader-in-c/) that seems more complete implementation. – Eldar Dec 01 '19 at 20:25
  • This is not ideal since it still requires writing the entire response to a disk file first. It's easy to say "but storage is cheap" but suppose that I might run this parsing tool in an environment where storage is limited, slow, or costs extra money. – fdmillion Dec 01 '19 at 20:31
  • Well then read the response as a stream then use the approach mentioned in the links. – Eldar Dec 01 '19 at 20:34
  • The second link basically looks like option 1 in my post - it's reading the stream char by char and looking for object start/end indicators. Is this the best option? It seems inefficient to read a very large (>1GB or even >4GB) stream char-by-char. – fdmillion Dec 01 '19 at 20:38
  • No it's not char by char but `JsonTextReader` reads token by token. It says "Represents a reader that provides fast, non-cached, forward-only access to JSON text data" – Eldar Dec 01 '19 at 20:43
  • So will that handle the fact that the JSON objects won't be separated with a `,`? i.e. it won't be a list of JSON objects, but just raw JSON objects separated by nothing. – fdmillion Dec 01 '19 at 20:46
  • it will expect tobe a valid json format whether array or object but **valid** – Eldar Dec 01 '19 at 20:51
  • Late comment, but if you can get the entire sequence of JSON objects as a `Stream` then you can parse it with Json.NET by setting [`JsonReader.SupportMultipleContent = true`](https://www.newtonsoft.com/json/help/html/P_Newtonsoft_Json_JsonReader_SupportMultipleContent.htm). See [What is the correct way to use JSON.NET to parse stream of JSON objects?](https://stackoverflow.com/q/57727883/3744182). – dbc Apr 14 '20 at 16:31

0 Answers0