Multiple Python Processes slow

Question

I have a python script which goes off and makes a number of HTTP and urllib requests to various domains.

We have a huge amount of domains to processes and need to do this as quickly as possible. As HTTP requests are slow (i.e. they could time out of there is no website on the domain) I run a number of the scripts at any one time feeding them from a domains list in the database.

The problem I see is over a period of time (a few hours to 24 hours) the scripts all start to slow down and ps -al shows they are sleeping.

The servers are very powerful (8 cores, 72GB ram, 6TB Raid 6 etc etc 80MB 2:1 connection) and are never maxed out, i.e. Free -m shows

-/+ buffers/cache:      61157      11337
Swap:         4510        195       4315

Top shows between 80-90% idle

sar -d shows average 5.3% util

and more interestingly iptraf starts off at around 50-60MB/s and ends up 8-10MB/s after about 4 hours.

I am currently running around 500 versions of the script on each server (2 servers) and they both show the same problem.

ps -al shows that most of the python scripts are sleeping which I don't understand why for instance:

0 S 0 28668  2987  0  80   0 - 71003 sk_wai pts/2 00:00:03 python
0 S 0 28669  2987  0  80   0 - 71619 inet_s pts/2 00:00:31 python
0 S 0 28670  2987  0  80   0 - 70947 sk_wai pts/2 00:00:07 python
0 S 0 28671  2987  0  80   0 - 71609 poll_s pts/2 00:00:29 python
0 S 0 28672  2987  0  80   0 - 71944 poll_s pts/2 00:00:31 python
0 S 0 28673  2987  0  80   0 - 71606 poll_s pts/2 00:00:26 python
0 S 0 28674  2987  0  80   0 - 71425 poll_s pts/2 00:00:20 python
0 S 0 28675  2987  0  80   0 - 70964 sk_wai pts/2 00:00:01 python
0 S 0 28676  2987  0  80   0 - 71205 inet_s pts/2 00:00:19 python
0 S 0 28677  2987  0  80   0 - 71610 inet_s pts/2 00:00:21 python
0 S 0 28678  2987  0  80   0 - 71491 inet_s pts/2 00:00:22 python

There is no sleep state in the script that gets executed so I can't understand why ps -al shows most of them asleep and why they should get slower and slower making less IP requests over time when CPU, memory, disk access and bandwidth are all available in abundance.

If anyone could help I would be very grateful.

EDIT:

The code is massive as I am using exceptions through it to catch diagnostics about the domain, i.e. reasons I can't connect. Will post the code somewhere if needed, but the fundamental calls via HTTPLib and URLLib are straight off the python examples.

More info:

Both

quota -u mysql quota -u root

come back with nothing

nlimit -n comes back with 1024 have change limit.conf to allow mysql to allow 16000 soft and hard connections and am able to running over 2000 script so far but still still the problem.

SOME PROGRESS

Ok, so I have changed all the limits for the user, ensured all sockets are closed (they were not) and although things are better, I am still getting a slow down although not as bad.

Interestingly I have also noticed some memory leak - the scripts use more and more memory the longer they run, however I am not sure what is causing this. I store output data in a string and then print it to the terminal after every iteration, I do clear the string at the end too but could the ever increasing memory be down to the terminal storing all the output?

Edit: No seems not - ran up 30 scripts without outputting to terminal and still the same leak. I'm not using anything clever (just strings, HTTPlib and URLLib) - wonder if there are any issues with the python mysql connector...?

It would probably help if you provide some code. How do you do the requests exactly? — Muhammad Alkarouri, Oct 01 '11 at 09:34
Are you sure the problem you are facing is not related to your internet connection getting worse upstream? — 6502, Oct 01 '11 at 10:12
It shouldn't be - the connection is pretty solid and is 80MB 2:1 both ways - if I kick off say 500 scripts the connection will sit at around 50MB/s for an hour or so and then reduce to 10MB/s over the space of a few hours. If I then kicked off another say 100 it will increase to use 40-50MB's again and then slow over a similar time period. - None of the scripts stop - they just seem to go to sleep as per the ps -al output above. — dan360, Oct 01 '11 at 10:16
lsof is also a good command to try. if there are 1024 open files then you reached your ulimit and you would expect the processes to be sleeping. you could try raisi g the ulimit and see whether performance keeps high longer. — extraneon, Oct 01 '11 at 10:35
You can use less (~10) number of processes to make concurrent requests if you use some async. framework such as twisted, gevent. Here's [gevent example](http://stackoverflow.com/questions/4783735/problem-with-multi-threaded-python-app-and-socket-connections/4850200#4850200), [twisted example](http://stackoverflow.com/questions/4783735/problem-with-multi-threaded-python-app-and-socket-connections/4868866#4868866). — jfs, Oct 01 '11 at 17:14

chown · Accepted Answer · 2011-10-01T23:15:25.343

7

Check the ulimit and quota for the box and the user running the scripts. /etc/security/limits.conf may also contain resource restrictions that you might want to modify.

ulimit -n will show the max number of open file descriptors allowed.

Might this have been exceeded with all of the open sockets?
Is the script closing each sockets when it's done with it?

You can also check the fd's with ls -l /proc/[PID]/fd/ where [PID] is the process id of one of the scripts.

Would need to see some code to tell what's really going on..

Edit (Importing comments and more troubleshooting ideas):

Can you show the code where your opening and closing the connections?
When just run a few script processes are running, do they too start to go idle after a while? Or is it only when there are several hundred+ running at once that this happens?
Is there a single parent process that starts all of these scripts?

If your using s = urllib2.urlopen(someURL), make sure to s.close() when your done with it. Python can often close things down for you (like if your doing x = urllib2.urlopen(someURL).read()), but it will leave that to you if you if told to (such as assigning a variable to the return value of .urlopen()). Double check your opening and closing of urllib calls (or all I/O code to be safe). If each script is designed to only have 1 open socket at a time, and your /proc/PID/fd is showing multiple active/open sockets per script process, then there is definitely a code issue to fix.

ulimit -n showing 1024 is giving the limit of open socket/fd's that the mysql user can have, you can change this with ulimit -S -n [LIMIT_#] but check out this article first:
Changing process.max-file-descriptor using 'ulimit -n' can cause MySQL to change table_open_cache value.

You may need to log out and shell back in after. And/Or add it to /etc/bashrc (don't forget to source /etc/bashrc if you change bashrc and don't want to log out/in).

Disk space is another thing that I have found out (the hard way) can cause very weird issues. I have had processes act like they are running (not zombied) but not doing what is expected because they had open handles to a log file on a partition with zero disk space left.

netstat -anpTee | grep -i mysql will also show if these sockets are connected/established/waiting to be closed/waiting on timeout/etc.

watch -n 0.1 'netstat -anpTee | grep -i mysql' to see the sockets open/close/change state/etc in real time in a nice table output (may need to export GREP_OPTIONS= first if you have it set to something like --color=always).

lsof -u mysql or lsof -U will also show you open FD's (the output is quite verbose).

import urllib2
import socket

socket.settimeout(15) 
# or settimeout(0) for non-blocking:
#In non-blocking mode (blocking is the default), if a recv() call 
# doesn’t find any data, or if a send() call can’t
# immediately dispose of the data,
# a error exception is raised.

#......

try:
    s = urllib2.urlopen(some_url)
    # do stuff with s like s.read(), s.headers, etc..
except (HTTPError, etcError):
    # myLogger.exception("Error opening: %s!", some_url)
finally:
    try:
        s.close()
    # del s - although, I don't know if deleting s will help things any.
    except:
        pass

Some man pages and reference links:

ulimit

quota

limits.conf

fork bomb

Changing process.max-file-descriptor using 'ulimit -n' can cause MySQL to change table_open_cache value

python socket module

lsof

edited Oct 01 '11 at 23:15

answered Oct 01 '11 at 09:44

chown

51,908
16
134
170

Sorry should have mentioned - already when through this and allowed mysql user 16384 (soft and hard) connections via PAM - before I did this I could only start circa 1020 scripts as would be expected - now I can open '000s. However ulimit -n still does show 1024 - not sure if this makes a difference? – dan360 Oct 01 '11 at 09:55
Also been googling Fork Bombs - but can't see that 1000 processes should have any issue especially as most of the time the scripts are spending time requesting data and with a 80MB (burst) connection would have thought this would be fine - will do some more investigation into quota - thanks. – dan360 Oct 01 '11 at 09:59
Both quota -u mysql and quota -u root come back with nothing. – dan360 Oct 01 '11 at 10:19
lrwx------ 1 root root 64 Oct 1 14:30 0 -> /dev/pts/2 lrwx------ 1 root root 64 Oct 1 14:30 1 -> /dev/pts/2 lrwx------ 1 root root 64 Oct 1 01:38 2 -> /dev/pts/2 lrwx------ 1 root root 64 Oct 1 14:30 3 -> socket:[275069545] lrwx------ 1 root root 64 Oct 1 14:30 4 -> socket:[313790164] lrwx------ 1 root root 64 Oct 1 14:30 6 -> socket:[313706399] – dan360 Oct 01 '11 at 11:33
lrwx------ 1 root root 64 Oct 1 14:30 0 -> /dev/pts/2 lrwx------ 1 root root 64 Oct 1 14:30 1 -> /dev/pts/2 lrwx------ 1 root root 64 Oct 1 01:38 2 -> /dev/pts/2 lrwx------ 1 root root 64 Oct 1 14:30 3 -> socket:[275069614] lrwx------ 1 root root 64 Oct 1 14:30 4 -> socket:[308695530] lrwx------ 1 root root 64 Oct 1 14:30 5 -> socket:[308708863] – dan360 Oct 01 '11 at 11:34
Pasted a few above - I think your earlier comment about not closing the connections might be it - in fact it looks very likely. (Am new to unix and very new to python!) – dan360 Oct 01 '11 at 11:36
socket close is usually done with try / finally or perhaps using the 'with' statement. if you also set socket timeouts you should always free your sockets. – extraneon Oct 01 '11 at 14:08
Good point @extraneon I've added a code example of how to close a sock down via a `try`/`finally` block. – chown Oct 01 '11 at 14:16
Many thanks for your input chown (and extraneon) it is really appreciated - will review code this evening and report back. – dan360 Oct 01 '11 at 14:47
Good luck @dan360. When you do get this resolved please come back and let us know what finally fixed it! – chown Oct 01 '11 at 23:38
Perhaps you meant IOError instead of HTTPError. – jfs Oct 02 '11 at 03:07
@J.F.Sebastian Its just an example, you could catch any exception at all within that block (or even a general `except:`), hence the `etcError`, use of undefined variables, and comments throughout... – chown Oct 02 '11 at 03:10
Ok, so I have changed all the limits for the user, ensured all sockets are closed (they were not) and although things are better, I am still getting a slow down although not as bad. Interestingly I have also noticed some memory leak - the scripts use more and more memory the longer they run, however I am not sure what is causing this. I store output data in a string and then print it to the terminal after every iteration, I do clear the string at the end too but could the ever increasing memory be down to the terminal storing all the output? – dan360 Oct 03 '11 at 17:58
It's possible but I don't know for sure if terminal output can cause this. Try putting `del myString` when your done with the string after each loop. Another method would be to write the output to a log file instead, then when you need to see the logs in real-time you can `tail -f logfile.log`. Using someething like the logging module. Also, try flushing the stdout buffer stream every so often with `sys.stdout.flush()`. In my opinion, it would be better in general to use less single scripts and more threads per script, if that's a viable option for you. – chown Oct 03 '11 at 18:15

dan360 · Answer 2 · 2011-10-15T13:33:05.853

Solved! - with massive help from Chown - thank you very much!

The slow down was because I was not setting socket timeout and as such over a period of time the robots where hanging trying to read data that did not exist. Adding a simple

timeout = 5
socket.setdefaulttimeout(timeout)

solved it (shame on me - but in my defence I am still learning python)

The memory leak is down to urllib and the version of python I am using. After a lot of googling it appears it is a problem with nested urlopens - lots of post online about it when you work out how to ask the right question of Google.

Thanks all for your help.

EDIT:

Something that also helped the memory leak issue (although not solved it completely) was doing manual garbage collection:

import gc
gc.collect

Hope it helps someone else.

Happy to hear you got this fixed! Glad I could be of help Dan! — chown, Nov 14 '11 at 19:07

knitti · Answer 3 · 2011-10-01T19:27:35.913

It is probably some system ressource you're starved of. A guess: could you feel the limits of a pool of sockets your system can handle? If yes, you might see improved performance if you can close the sockets faster/sooner.

EDIT: dependent the effort you want to take, you could restructure your application such that one process does multiple requests. One socket can be reused from within the same process, also a lot of different ressources. Twisted lends itself very much to this type of programming.

score 1 · Answer 4 · answered Oct 01 '11 at 16:52

Another system resource to take into account is ephemeral ports /proc/sys/net/ipv4/ip_local_port_range (on Linux). Together with /proc/sys/net/ipv4/tcp_fin_timeout they limit the number of concurrent connections.

From Benchmark of Python WSGI Servers:

This basically enables the server to open LOTS of concurrent connections.

echo “10152 65535″ > /proc/sys/net/ipv4/ip_local_port_range
sysctl -w fs.file-max=128000
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.core.somaxconn=250000
sysctl -w net.ipv4.tcp_max_syn_backlog=2500
sysctl -w net.core.netdev_max_backlog=2500
ulimit -n 10240

Multiple Python Processes slow

SOME PROGRESS

4 Answers4