I'm building a REST based web service which will serve a couple hundreds clients that will upload/request little bursts of information throughout the day and make one larger cache update (about 100-200kb) once a day.
While testing the large update on the production machine (a linux virtual machine in the cloud running Apache/PHP) I discovered to my utter dismay that data gets to client corrupted (i.e. with one or more wrong character) literally MOST of the times.
Example of corrupted JSON, parser says SyntaxError: JSON.parse: expected ':' after property name in object at line 1 column 81998 of the JSON data
:
"nascita":"1940-12-17","attiva":true,","cognome":"MILANI"
should be
"nascita":"1940-12-17","attiva":"true","cognome":"MILANI"
This is the HTTP Header of the answer
Connection Keep-Alive
Content-Type application/json
Date Fri, 02 Jun 2017 16:59:39 GMT
Keep-Alive timeout=5, max=100
Server Apache/2.4.18 (Ubuntu)
Transfer-Encoding chunked
I am certainly not an expert when it comes to networking but I used to think that such occurrences, failures of both IP and TCP error detection, were extremely rare. (I found this post interesting: Can a TCP checksum produce a false positive? If yes, how is this dealt with?)
So... what's going here? Am I missing something?
I started to think of possible solutions.
The quickest I could think of was using HTTP compression: if the client is unable to decompress the content (which is very likely in case of data corruption) then I can ask for the content again. I enabled that on Apache and, to my surprise, all responses completed with valid data. Could it be that web browsers (I'm using good old Firefox for testing the web service) have some built-in mechanism for re-requesting corrupt compressed data? Or MAYBE the smaller, less regular nature of compressed data makes TCP/IP mistakes less likely??
The other quick solution that came to my mind was to calculate a checksum of the content, something I could do for smaller requests that don't really benefit from compression. I am trying to figure out if and how the Content-MD5 field in HTTP could help me... Web browser seems to ignore it, so I guess i will have to compute and compare it explicitely on my client...
Using TLS may be another good idea, possibly the best.
Or again.... am I missing something HUGE? Like, I don't know, for some reason my Apache is using UDP??