0

I am doing my bachelor's thesis where I wrote a program that is distributed over many servers and exchaning messages via IPv6 multicast and unicast. The network usage is relatively high but I think it is not too high when I have 15 servers in my test where there are 2 requests every second that are going like that:

Server 1 requests information from server 3-15 via multicast. every of 3-15 must respond. if one response is missing after 0.5 sec, the multicast is resent, but only the missing servers must respond (so in most cases this is only one server) Server 2 does exactly the same. If there are missing results after 5 retries the missing servers are marked as dead and the change is synced with the other server (1/2)

So there are 2 multicasts every second and 26 unicasts every second. I think this should not be too much?

Server 1 and 2 are running python web servers which I use to do the request every second on each server (via a web client)

The whole szenario is running in a mininet environment which is running in a virtual box ubuntu that has 2 cores (max 2.8ghz) and 1GB RAM. While running the test, i see via htop that the CPUs are at 100% while the RAM is at 50%. So the CPU is the bottleneck here.

I noticed that after 2-5 minutes (1 minute = 60 * (2+26) messages = 1680 messages) there are too many missing results causing too many sending repetitions while new requests are already coming in, so that the "management server" thinks the client servers (3-15) are down and deregisters them. After syncing this with the other management server, all client servers are marked as dead on both management servers which is not true...

I am wondering if the problem could be my debug outputs? I am printing 3-5 messages for every message that is sent and received. So that are about (let's guess it are 5 messages per sent/recvd msg) (26 + 2)*5 = 140 lines that are printed on the console.

I use python 2.6 for the servers.

So the question here is: Can the console output slow down the whole system that simple requests take more than 0.5 seconds to complete 5 times in a row? The request processing is simple in my test. No complex calculations or something like that. basically it is something like "return request_param in ["bla", "blaaaa", ...] (small list of 5 items)"

If yes, how can I disable the output completely without having to comment out every print statement? Or is there even the possibility to output only lines that contain "Error" or "Warning"? (not via grep, because when grep becomes active all the prints already have finished... I mean directly in python)

What else could cause my application to be that slow? I know this is a very generic question, but maybe someone already has some experience with mininet and network applications...

Simon Hessner
  • 1,757
  • 1
  • 22
  • 49
  • I forgot to mention: The whole output is sent to a file via bash: python myscript.py > log.txt – Simon Hessner Apr 05 '15 at 19:48
  • 1
    Formatted output is expensive, especially to the console. You could try `> /dev/null`. That should make some difference. The real thing to try is commenting all the I/O. Regardless, [*try this*](http://stackoverflow.com/a/4299378/23771). It costs nothing and tells you just what's going on. – Mike Dunlavey Apr 05 '15 at 19:53
  • Yes, console output may be a bottleneck and not only depends on your code but also the program that displays the text. Just benchmark in the Windows' console and you'll see what I mean... Some linux virtual terminals did a huge work in that direction. Err... Ever heard of what logging is? – Cilyan Apr 05 '15 at 21:31
  • What exactly do you mean by logging? – Simon Hessner Apr 05 '15 at 21:59
  • I know what logging is in general, but in what context do you mean that? – Simon Hessner Apr 05 '15 at 22:02
  • Ok, I removed the debug output and the CPU usage is now a little lower. But even when there are no requests and 20 servers running, I have 50% usage on every core. So I guess the whole framework I use is doing too much stuff in background and that this has to be optimized... – Simon Hessner Apr 06 '15 at 16:49
  • Don't look at CPU usage. Is it bad or good if it's down or up? It's just confusing you. Instead, look at the elapsed time to perform a certain amount of work. Then, if you want it to go faster, don't try to figure out what's *slow*. Try to figure out *what you can eliminate*. They aren't at all the same thing. – Mike Dunlavey Apr 08 '15 at 00:03

1 Answers1

0

I finally found the real problem. It was not because of the prints (removing them improved performance a bit, but not significantly) but because of a thread that was using a shared lock. This lock was shared over multiple CPU cores causing the whole thing being very slow.

It even got slower the more cores I added to the executing VM which was very strange...

Now the new bottleneck seems to be the APScheduler... I always get messages like "event missed" because there is too much load on the scheduler. So that's the next thing to speed up... :)

Simon Hessner
  • 1,757
  • 1
  • 22
  • 49