Storing and sending raw file data within a JSON object

Question

I'm looking for a way to transfer the raw file data of any file-type with any possible content (By that I mean files and file-content are all user generated) both ways using xhr/ajax calls in a Backbone front-end against a Django back-end.

EDIT: Maybe the question is still unclear...

If you open a file in an IDE (such as Sublime), you can view and edit the actual code that comprises that file. I'm trying to put THAT raw content into a JSON so I can send to the browser, it can be modified, and then sent back.

I posted this question because I was under the impression that because the contents of these files can effectively be in ANY coding language that just stringify-ing the contents and sending it seems like a brittle solution that would be easy to break or exploit. Content could contain any number of ', ", { and } chars that would seem to break JSON formatting, and escaping those characters would leave artifacts within the code that would effectively break them (wouldn't it?).

If that assumption is wrong, THAT would also be an acceptable answer (so long as you could point out whatever it is I'm overlooking).

The project I'm working on is a browser-based IDE that will receive a complete file-structure from the server. Users can add/remove files, edit the content of those files, then save their changes back to the server. The sending/receiving all has to be handled via ajax/xhr calls.

Within Backbone, each "file" is instantiated as a model and stored in a Collection. The contents of the file would be stored as an attribute on the model.
Ideally, file content would still reliably throw all the appropriate events when changes are made.
Fetching contents should not be broken out into a separate call from the rest of the file model. I'd like to just use a single save/fetch call for sending/receiving files including the raw content.

Solutions that require Underscore/jQuery are fine, and I am able to bring in additional libraries if there is something available that specializes in managing that raw file data.

What is the question? You can easily store your files as `{data: 'stringyfied_data'}` model — Lesha Ogonkov, Sep 08 '15 at 20:25
@LeshaOgonkov - My assumption (which might be wrong?) is that there would inevitably be file types/formats or content within these files that could break or exploit the interface if I stringified everything. Is that wrong then? Additionally, how reliable would it be to stringify everything across potentially any code language, without modifying or corrupting the contents of those files. — relic, Sep 08 '15 at 20:28
You can parse and escape model data, it's up to your implementation, Backbone just helps you to store your ideas in JS — Lesha Ogonkov, Sep 08 '15 at 20:30
My question is more about the sending/receiving/storing of the raw data. The context just happens to be within Backbone, but I'll reword the question title to make that a little more clear. — relic, Sep 08 '15 at 20:34
It's a huge topic. You can update raw data by calculating changes and sending them to server, your received data could be AST, or some special format, for keeping markup. There is a lot of options. But, i think Python is not best option for managing highload project like that. May be for 1-2 persons prototype only. — Lesha Ogonkov, Sep 08 '15 at 20:50
@LeshaOgonkov - A solution for generating and storing files on the server is already in place. I just need to find a method of sending them over and getting them back, and if/what conversion needs to take place at that time so that the front end can properly manage them. You mentioned AST? I could start by looking into that. Is there any additional terminology that might help when Googling around for options? — relic, Sep 08 '15 at 21:22
@relic180 If interpret Question correctly , no "conversion" would be needed to achieve expected results , as content-type of generated ,edited file would be `"text/plain"` ? — guest271314, Sep 13 '15 at 14:15
@guest271314 - Not sure I understand how `text/plain` would help here. The files themselves are not going to be sent either direction on their own, but will always live inside of a JSON because If the user's filetree contained dozens (or potentially hundreds) of files, I shouldn't be firing an explicit call for every single file, right. The "conversion" would be whatever happens to ensure filecontent plays nice when it gets stuck inside of that JSON. — relic, Sep 13 '15 at 20:57
@relic Is requirement to open local file, e.g., "file.js" in browser, modify file contents, save modified file as "file.js" ? — guest271314, Sep 13 '15 at 23:25
@guest271314 - Yes. Although the user won't see any of the uploads/downloads happening in the background. The goal is to mimic exactly what it feels like to open and edit a file in an IDE, except this will be running in a browser using a front-end framework (so, no page refreshing). — relic, Sep 14 '15 at 03:14
@relic _"Yes. Although the user won't see any of the uploads/downloads happening in the background. "_ Not certain if this is possible. User would have to select file to upload to browser , select to overwrite existing file with same name — guest271314, Sep 14 '15 at 14:00
@guest271314 - All of it is possible, if the spec was just to support text files (that are easily escaped). So the root of my problem is that the contents of these files are not likely to behave themselves within the context of a JSON object. If I can nail down a reliable way to store that content inside JSON, I don't have any issues nailing down the rest of the process. — relic, Sep 14 '15 at 23:01
@relic _"the contents of these files are not likely to behave themselves within the context of a JSON object. If I can nail down a reliable way to store that content inside JSON"_ Tried suggestion of `base64` string at http://stackoverflow.com/a/32530715/ ? See also http://stackoverflow.com/questions/28207106/pdf-file-upload-ajax-html/ — guest271314, Sep 15 '15 at 05:29
@relic Does your solution have to implemented in Django/Python? I guess if speed is not your concern then it's ok — jv-k, Sep 16 '15 at 20:31
@JohnValai - The primary server will definitely be Django(py), however it wouldn't be out of the question to setup some intermediary server-side instance or process of some sort, if there was a good reason to do so. — relic, Sep 16 '15 at 22:59

score 10 · Accepted Answer · edited May 23 '17 at 12:25

Interesting question. The code required to implement this would be quite involved, sorry that I'm not providing examples, but you seem like a decent programmer and should be able to implement what's mentioned below.

Regarding the sending of raw data through JSON, all you would need to do to make it JSON-safe and not break your code is to escape the special characters by stringyfying using Python's json.dumps & JavaScript's JSON.stringyfy. [1]

If you are concerned about some form of basic tamper-proofing, then light encoding of your data will fit the purpose, in addition to having the client and server pass a per-session token back and forth with JSON transfers to ensure that the JSON isn't forged from a malicious address.

If you want to check the end-to-end integrity of the data, then generate an md5 checksum and send it inside your JSON and then generate another md5 on arrival and compare with the one inside your JSON.

Base64 encoding: The size of your data would grow by 33% as it encodes four characters to represent three bytes of data.

Base85: Encodes four bytes as five characters and will grow your data by 25%, but uses much more processing overhead than Base64 in Python. That's a 8% improvement in data size, but at the expense of processing overhead. Also it's not string safe as double & single quotation marks, angle brackets, and ampersands cannot be used unescaped inside JSON, as it uses all 95 printable ASCII characters. Needs to be stringyfied before JSON transport. [2]

yEnc has as little as 2-3% overhead (depending on the frequency of identical bytes in the data), but is ruled out by impractical flaws (see [3]).

ZeroMQ Base-85, aka Z85. It's a string-safe variant of Base85, with a data overhead of 25%, which is better than Base64. No stringyfying necessary for sticking it into JSON. I highly recommended this encoding algorithm. [4] [5] [6]

If you're sending only small files (say a few KB), then the overhead of binary-to-text conversion will be acceptable. With files as large as a few Mbs, it might not be acceptable to have them grow by 25-33%. In this case you can try to compress them before sending. [7]

You can also send data to the server using multipart/form-data, but I can't see how this will work bi-directionally.

UPDATE

In conclusion, here's my solution's algorithm:

Sending data

Generate a session token and store it for the associated user upon login (server), or retrieve from the session cookie (client)
Generate MD5 hash for the data for integrity checking during transport.
Encode the raw data with Z85 to add some basic tamper-proofing and JSON-friendliness.
Place the above inside a JSON and send POST when requested.

Reception

Grab JSON from POST
Retrieve session token from storage for the associated user (server), or retrieve from the session cookie (client).
Generate MD5 hash for the received data and test against MD5 in received JSON, reject or accept conditionally.
Z85-decode the data in received JSON to get raw data and store in file or DB (server) or process/display in GUI/IDE (client) as required.

References

[1] How to escape special characters in building a JSON string?

[2] Binary Data in JSON String. Something better than Base64

[3] https://en.wikipedia.org/wiki/YEnc

[4] http://rfc.zeromq.org/spec:32

[5] Z85 implementation in C/C++ https://github.com/artemkin/z85

[6] Z85 Python implementation of https://gist.github.com/minrk/6357188

[7] JavaScript zip library http://stuk.github.io/jszip/

[8] JavaScript Gzip SO JavaScript implementation of Gzip

Fantastic reply. Really appreciate all the information. Although the exact implementation I'm gonna end up with has been complicated by the additional requirement of a terminal... so it looks as though I may end up with a solution that involves websockets rather than pure AJAX. I was already leaning toward a base64 solution after Nelson's post, but I had no idea about those other encoding methods you listed and hadn't even thought about integrating a hash test until you mentioned it. Thanks for that. — relic, Sep 17 '15 at 19:27
Your welcome. If you do it differently, please share your solution :) — jv-k, Sep 17 '15 at 19:33
Good answer, just please don't use MD5 for anything. If it is just for integrity checks against random errors, there are more efficient checksum algorithms, which may also support error correction. If collision resistance is needed MD5 is broken and SHA256 or something of similar strength is required. MD5 is basically never the answer. — Leonidaz0r, Apr 28 '17 at 11:51

Nelson Teixeira · Answer 2 · 2015-09-16T04:25:12.300

AFAI am concerned a simple Base64 conversion will do it. Stringify, convert to base64, then pass it to the server and decode it there. Then you won't have the raw file transfer and you will still maintain your code simple.

I know this solution could seem a bit too simple, but think about it: many cryptographics algorithms can be broken given the right hardware. One of the most secure means would be through a digital certificate and then encrypt data with the private key and then send it over to the server. But, to reach this level of security every user of your application would have to have a digital certificate, which I think would be an excessive demand to your users.

So ask yourself, if implementing a really safe solution adds a lot of hassle to your users, why do you need a safe transfer at all? Based on that I reaffirm what I said before. A simple Base64 conversion will do. You can also use some other algotithms like SHA256 ou something to make it a litter bit safer.

'Safety' is definitely a lower concern than reliability and speed. There shouldn't be any personal information contained in these files, and there are already containment measures in place to limit the impact of malicious code. — relic, Sep 11 '15 at 19:28

score 4 · Answer 3 · edited May 23 '17 at 10:27

4

If the only concern here is that the raw content of your code files (the "data" your model is storing), will cause some type of issue when stored in JSON, this is easily availed by escaping your data.

Stringifying your raw code file contents can cause issues as anything resembling JavaScript or JSON will be parsed into an actual JSON object. Your code file data can and should be stored simply as an esacaped string. Your fear here is that said string may contain characters that could break being stored in JavaScript inside a string, this is alleviated by escaping the entire string, and thus double, triple, quadruple, etc. escaping anything already escaped in the code file.

In essence it is important to remember here that raw code in a file is nothing but a glorified string when stored in a database, unless you are adding in-line metadata dynamically. It's just text, and doing standard escaping will make it safe to store in whatever format as a string (inside "" or '') in JSON.

I recommend reading this SO answer, as I also referenced it to verify what I already thought was correct: How To Escape a JSON string containing newline characters using JavaScript

edited May 23 '17 at 10:27

Community

1
1

answered Sep 16 '15 at 01:26

Tyler Durden

1,506
13
21

I'm not sure I'm convinced that it's as simple as escaping, even running multiple escapes against it. How would I automate a process that determines how many times, and against which characters to test and escape (since the actual language the files are written in is effectively unknown)? Then, how would I communicate that process to the front end (after the transfer has been made) to properly parse the files and guaranteeing that no stray characters still remain in the content? I suppose a "decoding" key could be attached to each data model... – relic Sep 17 '15 at 19:16
You only need to escape it once. – Tyler Durden Sep 17 '15 at 20:17
I have not yet, but I definitely will give it a go and see how it works. If a simple string escape actually does work, that'll certainly be faster and simpler than the other encoding methods listed above (and because of that, I'd be a fool to just skip over it as a potential option). The expiration on the SO bounty was just a couple hours away so I had to award it, and John's answer provided the largest concentration of useful ideas for me. – relic Sep 17 '15 at 20:26
No worries. I'm not interested in that I just want to know if a simple escape works or not so I can refactor some of my own projects. – Tyler Durden Sep 17 '15 at 20:41

Storing and sending raw file data within a JSON object

3 Answers3

Linked