3

I'm using python + Django to handle incoming web requests that can post a large amount of JSON attached as one of the POST data fields (e.g. var1=abc&json_var=lots_of_data&other_var=xxx). I'd like to process the JSON in a streaming fashion with my own streaming json parser which takes a file-like handle as its input argument. It appears from https://docs.djangoproject.com/en/1.11/ref/request-response/ that this is feasible, using HttpRequest.__iter__(), but I can't find any examples of how to achieve this with my own code (i.e. not just importing a library like xml.etree.ElementTree).

Basically, I'd like to do the following:

POST request w/ big JSON => Django/python => create file-like handle to read POST => streaming url decoder => streaming JSON processor

I can use ijson for the streaming JSON processor. How do I fill in the two gaps for creating a file-like handle to the POST data and passing it to a streaming url decoder? Would prefer not to roll my own of either but I suppose if necessary I could.

mwag
  • 3,557
  • 31
  • 38

2 Answers2

2

I was only able to solve this by rolling my own generators and iterators. There were a few keys to solving this:

  • finding how to access the file-like handle for the POST data in the case where data is sent chunked. I was able to find this at request.META.get('wsgi.input'), which I located by using this post to dump all of the request attributes
  • roll my own generator to read a file-like handle and yield (varname, data_chunk) pairs
  • roll my own generator, based on a modified version of this post, to create a file-like handle that has the normal read() operations but has three additional features:
    • f.varname returns the name of the current variable that is being read
    • data is url-unencoded before it is passed back from read()
    • f.next_pair() advances the handle to read the next variable. So, f.read() is called until it is finished with the first variable, then if there is another, f.next_pair() will return true and f.read() can be again called until the next variable is done being read
  • further stream processing can be achieved in the main loop

Putting it all together it looks something like:

f = request.META.get('wsgi.input')
ff = some_magic_adaptor(qs_from_file_to_generator(f))

while ff.next_pair():
    print 'varname:' + ff.varname
    if ff.varname == 'stream_parse_this':
        parser = stream_parser(ff)
        for event_results in parser:
            do_something

    while True:
        data = ff.read(buffer_size)
        if not data:
            break
        do_something_with_data_chunk(data)
mwag
  • 3,557
  • 31
  • 38
1

__iter__() is just a wrapper for xreadlines which in turn is just a loop which calls and yields data a line at a time using the HttpRequest's input stream. So you can replace that sample code in the manual with sometime like this

parser = MyJsonParser()
for line in (request):
    parser.process(line)

You haven't posted code, so you will have to adapt as you see fit.

Also be aware that this may or may not be a real streaming process depending on your setup. It's very likely that your server mechanism will read the entire post body at once and pass it to your view without sending it a line at a time. In which case the appearance of streaming is illusory.

e4c5
  • 52,766
  • 11
  • 101
  • 134
  • Thank you. Assuming that the incoming request is encoded in chunked fashion, is there any way to ensure that the server ingests the data in chunks as well-- or at the least, limits the memory it allocates to ingesting the post data? I.e. if the incoming POST is 2GB in size, I don't want the server to put it all in memory. Ideally I can process the data as it comes in, or alternatively, the data is written to disk using fixed memory before I process it. – mwag Jun 03 '17 at 09:44
  • I think that comment should be a new question. Even as a question it would be far too broad. You would have to limit it to a specific server – e4c5 Jun 03 '17 at 09:49
  • Also if the POST data is url-encoded, then the first call to the iterator will return all data in a single line, even if it's e.g. 2GB (or maybe not if sent chunked?) Re "limit it to a specific server": I thought that the question implicitly has a server specified (Django's default web server)-- could you pls explain what add'l info would be needed? – mwag Jun 03 '17 at 10:01
  • Also, ```for line in (request):``` did not yield anything when the POST was sent chunked. Re server details, I'm testing with cherrypy, but I was hoping thought that the requests interface would have abstracted that away – mwag Jun 03 '17 at 10:34
  • You definitely shouldn't be using the django development server here!!!! – e4c5 Jun 03 '17 at 10:34