1

I'm crawling a website and gathering it's data. I have some crawler machine, where they send data to a central server. Part of code in crawlers which send data to server is as follows :

requests.post(url, json=data, timeout=timeout, cookies=cookies, headers=headers)

At the central server side which uses django, I have the following code:

def save_users_data(request):
    body = json.loads(request.body)
    // do something on data received

sometimes server receives incomplete data from crawlers and so json package cannot load data and raises error. For example server received following data in request.body :

b'{"social_network": "some network", "text": "\\u0646\\u06cc\\u0633 \\u0628\\u0627\\u06cc\\u062f \\u0622\\u062a\\u06cc\\u0634 \\u062f\\u0631\\u0633\\u062a \\u06a9\\u0631\\u062f\\u0628\\u0631\\u06af\\u0634\\u062a\\u'

and raises following error :

json.decoder.JSONDecodeError: Invalid \uXXXX escape

Where is the problem?

EDIT

This some lines of nginx error.log file :

2018/07/25 12:54:39 [info] 29199#29199: *2520751 client 45.55.4.47 closed keepalive connection
2018/07/25 12:54:39 [info] 29199#29199: *2520753 client 188.166.71.114 closed keepalive connection
2018/07/25 12:55:35 [info] 29199#29199: *2520755 client 45.55.4.47 closed keepalive connection
2018/07/25 12:55:58 [info] 29199#29199: *2520757 client 45.55.4.47 closed keepalive connection
2018/07/25 12:55:59 [info] 29199#29199: *2520759 client 45.55.197.140 closed keepalive connection
2018/07/25 12:56:03 [info] 29199#29199: *2520761 client 188.166.71.114 closed keepalive connection
2018/07/25 12:56:04 [info] 29197#29197: *2520715 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 167.99.189.246, server: 91.208.165.33, request: "POST /crawler/save/users-data/ HTTP/1.1", upstream: "http://unix:/home/social/centralsystem/centralsystem.sock:/crawler/save/users-data/", host: "91.208.165.33"
2018/07/25 12:56:11 [info] 29197#29197: *2520723 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 159.89.20.103, server: 91.208.165.33, request: "POST /crawler/save/users-data/ HTTP/1.1", upstream: "http://unix:/home/social/centralsystem/centralsystem.sock:/crawler/save/users-data/", host: "91.208.165.33"
2018/07/25 12:56:12 [info] 29197#29197: *2520724 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 209.97.142.45, server: 91.208.165.33, request: "POST /crawler/save/users-data/ HTTP/1.1", upstream: "http://unix:/home/social/centralsystem/centralsystem.sock:/crawler/save/users-data/", host: "91.208.165.33"
2018/07/25 12:56:16 [info] 29199#29199: *2520765 client 67.207.92.190 closed keepalive connection
2018/07/25 12:56:17 [info] 29197#29197: *2520729 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 188.226.178.98, server: 91.208.165.33, request: "POST /crawler/save/users-data/ HTTP/1.1", upstream: "http://unix:/home/social/centralsystem/centralsystem.sock:/crawler/save/users-data/", host: "91.208.165.33"
2018/07/25 12:56:22 [info] 29199#29199: *2520770 client 188.166.71.114 closed keepalive connection
2018/07/25 12:56:26 [info] 29199#29199: *2520767 client 159.89.20.103 closed keepalive connection
2018/07/25 12:56:27 [info] 29197#29197: *2520777 client 159.89.20.103 closed keepalive connection
2018/07/25 12:56:28 [info] 29199#29199: *2520773 client 188.226.178.98 closed keepalive connection
2018/07/25 12:56:28 [info] 29197#29197: *2520779 client 45.55.197.140 closed keepalive connection
2018/07/25 12:56:29 [info] 29197#29197: *2520782 client 188.226.178.98 closed keepalive connection
2018/07/25 12:56:30 [info] 29199#29199: *2520768 client 209.97.142.45 closed keepalive connection
2018/07/25 12:56:30 [info] 29197#29197: *2520781 client 67.207.92.190 closed keepalive connection
2018/07/25 12:56:31 [info] 29197#29197: *2520786 client 209.97.142.45 closed keepalive connection
2018/07/25 12:56:36 [info] 29199#29199: *2520775 client 67.207.92.190 closed keepalive connection
mohammad
  • 2,232
  • 1
  • 18
  • 38
  • did you try `request.json()` to get json response? – JPG Jul 25 '18 at 08:00
  • I have not tried this but I think my http body is not complete. Am I wrong? – mohammad Jul 25 '18 at 08:03
  • @JerinPeterGeorge `request.json()` parses the data from request.body itself so it won't help. OP can you check if your server is sending the data correctly? like logging what is being sent – Arpit Solanki Jul 25 '18 at 08:03
  • @ArpitSolanki How can I log what is sent finally on the network? I know there is no problem in my data – mohammad Jul 25 '18 at 08:05
  • @mohammad I am asking that you log what is sent from your server and check what is received from your central server. That way you will know that something happend over the network – Arpit Solanki Jul 25 '18 at 08:08
  • This problem is mostly related to server but not client. Maybe your server closes connection before receives the whole content. Are you using nginx as reverse proxy server? – Sraw Jul 25 '18 at 08:16
  • @Sraw I'm using ngnix + gunicorn – mohammad Jul 25 '18 at 08:16
  • I guess there should be something useful in your nginx access log or error log. Have a check? For example, mismatched content-length. – Sraw Jul 25 '18 at 08:20
  • @Sraw I think you are write and put last lines of nginx error.log . Do you know what is the problem? – mohammad Jul 25 '18 at 08:32
  • Try to add `proxy_ignore_client_abort on;` to your nginx config. – Max Jul 25 '18 at 08:33
  • @JonhyBeebop but my client should wait for the server response. – mohammad Jul 25 '18 at 08:34
  • `so upstream connection is closed too` as you can see. This should be the problem. Upstream connection closed before it should. The reason seems to be that your clients really close connections. Maybe your `timeout` is too short? – Sraw Jul 25 '18 at 08:38
  • @Sraw timeout is 180 second and actually it's not low but the task may take more time. So should I use a bigger timeout? – mohammad Jul 25 '18 at 08:40
  • Actually, I don't know :) I don't know what task are you doing and I don't know why you need more than 180s to handle just one single request. So it is hard for others to really solve this problem. Maybe a message queue based structure is better. For example, `django` just add tasks into queue, and there is another thread/process to process the tasks. I guess your clients don't need the response returned from server right? – Sraw Jul 25 '18 at 08:45
  • My task is writing lots of data to database. My database is really large and writing takes a long time. My clients need the response – mohammad Jul 25 '18 at 08:49

2 Answers2

0

Edit

Can you try running it with json.dumps instead of json.loads ?

json.loads will only accept a unicode string, so decoding might be necessary.

body_unicode = request.body.decode('utf-8')
body = json.loads(body_unicode)
content = body['content']

Read it here : Trying to parse `request.body` from POST in Django

ItayBenHaim
  • 163
  • 7
0

As mentioned in the comments problem was my requests.post timeout was low for my server response and clients closed connections before server resoponse.

mohammad
  • 2,232
  • 1
  • 18
  • 38