2

I have an API server built by Python flask. And I need a group of clients/computers to send data over to the API server by making http post request.

The data here is actually html content. (NOTE: I am not turning legit data into HTML/XML format, the data its self is HTML that I have collected form the web) which is usually about 200KB per page. And I am trying to alleviate the network load as much as I can by using serial/deserial and compression.

I am thinking about instead of send raw HTML across the network. Is there any kind of method like Serialize the html object (BeautifulSoup soup?) and deserialize on the server side. Or use some compression method to zip the file first and then post the data to the API server. On the server side, it can decompress the data once it receive the compressed one.

What I have done:

(1) I tried to turn the raw HTML text into a soup object, and then use Pickle to serialize that. However, it told me too many recursions and errorred out. I also tried pickle the raw html and the compression performance is bad... almost the same size as the raw html string.

(2) I tried zlib to compress the file beforehand and then it is 10% the size of its original one. However, is this the legit way to approach this problem?

Any thoughts?

B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
  • Depending on your data. HTML is probably not the most compact format. May be JSON will work better for you. Or even CSV if data looks like a table. Anyway always use compression (`zlib` will reduce your data size). – Ivan Nevostruev Apr 16 '14 at 00:58
  • Since you mentioned deserializing on the other side, can we assume you are building both sides of the application? If so, you should consider another format anyway - perhaps JSON as already mentioned. – bsoist Apr 16 '14 at 01:02
  • @IvanNevostruev, sorry for the confusion, I am not sending data by deliberately putting them into HTML format, the raw data is "HTML", which means web pages.. – B.Mr.W. Apr 16 '14 at 01:13
  • 1
    I see. In this case you might consider moving some logic to client side to parse html and simplify it removing redundancy (for example you don't need end tag in JSON). If you need HTML as is, then there are not many options besides general compression algorithms like `zlib`. – Ivan Nevostruev Apr 16 '14 at 01:18
  • That's the approach I would take - move the responsibility for parsing the data to the client side. – bsoist Apr 16 '14 at 01:20
  • FWIW if `zlib` is really deflating these files 90% and it exists on both ends, you're probably not going to get much better than that. – Two-Bit Alchemist Apr 16 '14 at 01:20
  • `IO` much much much slower than computation. +1 `zlib` – emesday Apr 16 '14 at 01:22
  • 1
    Thanks for the confirmation that `zlib` is the way to go. – B.Mr.W. Apr 16 '14 at 01:40
  • Since this is a community effort at this point, rather than anyone getting rep for it, I recommend OP post this as an answer and accept his own. – Two-Bit Alchemist Apr 16 '14 at 03:37

1 Answers1

0

Well, I got inspired a lot by the comments from you guys and I came up with a solution that compress the HTML content using zlib and POST the data to API server, on the Flask API server side, I extract the data and push to mongodb for storage.

Here is the part that might save some future headache.

Client Side:

myinput = "http://www.exmaple.com/001"
myoutput = "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" ... /html>"
result = {'myinput':myinput, 'myoutput': myoutput}
data = zlib.compress(str(result))
opener.open("www.host.com/senddata", data) 

Server Side:

@app.route('/contribute', methods=['POST'])
def contribute():
    try:
        data = request.stream.read()
        result = eval(zlib.decompress(data))
        db.result.insert(result)
    except:
        print sys.exc_info()
        pass
    return 'OK'

Result in mongodb:

{ 
"_id" : ObjectId("534e0d346a1b7a0e48ff9076"), 
"myoutput" : "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" ... /html>",  
"myinput" : "http://www.exmaple.com/001" 
}

(Note: As you have noticed, the final version in mongo somehow escaped all the sensible characters by putting a slash in front of them, like double quote, not sure how to change it back.)

There were some discussions about retrieving binary data in flask. Like here. So you don't have to mess up with the header if you read from request.stream directly.

Thanks!

Community
  • 1
  • 1
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178