Conceptual question on (a tool like) LoadRunner

Question

I'm using LoadRunner to stress-test a J2EE application.

I have got: 1 MySQL DB server, and 1 JBoss App server. Each is a 16-core (1.8GHz) / 8GB RAM box.

Connection Pooling: The DB server is using max_connections = 100 in my.cnf. The App Server too is using min-pool-size and max-pool-size = 100 in mysql-ds.xml and mysql-ro-ds.xml.

I'm simulating a load of 100 virtual users from a 'regular', single-core PC. This is a 1.8GHz / 1GB RAM box.

The application is deployed and being used on a 100 Mbps ethernet LAN.

I'm using rendezvous points in sections of my stress-testing script to simulate real-world parallel (and not concurrent) use.

Question:

The CPU utilization on this load-generating PC never reaches 100% and memory too, I believe, is available. So, I could try adding more virtual users on this PC. But before I do that, I would like to know 1 or 2 fundamentals about concurrency/parallelism and hardware:

With only a single-core load generator as this one, can I really simulate a parallel load of 100 users (with each user using operating from a dedicated PC in real-life)? My possibly incorrect understanding is that, 100 threads on a single-core PC will run concurrently (interleaved, that is) but not parallely... Which means, I cannot really simulate a real-world load of 100 parallel users (on 100 PCs) from just one, single-core PC! Is that correct?
Network bandwidth limitations on user parallelism: Even assuming I had a 100-core load-generating PC (or alternatively, let's say I had 100, single-core PCs sitting on my LAN), won't the way ethernet works permit only concurrency and not parallelism of users on the ethernet wire connecting the load-generating PC to the server. In fact, it seems, this issue (of absence of user parallelism) will persist even in a real-world application usage (with 1 PC per user) since the user requests reaching the app server on a multi-core box can only arrive interleaved. That is, the only time the multi-core server could process user requests in parallel would be if each user had her own, dedicated physical layer connection between it and the server!!
Assuming parallelism is not achievable (due to the above 'issues') and only the next best thing called concurrency is possible, how would I go about selecting the hardware and network specification to use my simulation. For example, (a) How powerful my load-generating PCs should be? (b) How many virtual users to create per each of these PCs? (c) Does each PC on the LAN have to be connected via a switch to the server (to avoid) broadcast traffic which would occur if a hub were to be used in instead of a switch?

Thanks in advance,

/HS

score 1 · Answer 1 · answered Jan 27 '11 at 07:58

It sounds to me like you're over thinking this a bit. Your servers are fast and new, and are more than suited to handle lots of clients. Your bottleneck (if you have one) is either going to be your application itself or your 100m network.

1./2. You're testing the server, not the client. In this case, all the client is doing is sending and receiving data - there's no overhead for client processing (rendering HTML, decoding images, executing javascript and whatever else it may be). A recent unicore machine can easily saturate a gigabit link; a 100 mbit pipe should be cake.

Also - The processors in newer/fancier ethernet cards offload a lot of work from the CPU, so you shouldn't necessarily expect a CPU hit.

3. Don't use a hub. There's a reason you can buy a 100m hub for $5 on craigslist.

Seth, I do realize I'm testing the server and not client. What I don't know is how to identify (and use) the peak capabilities of the client, the server, and the network without 'hanging' any component the system in any way. E.g. there is a context switch overhead with OS threads (each virtual user being an OS thread)... so I was/am not sure if I could have 200, 300 virtual user threads without spending too much time context switching. Q: Is there a book/resource where such performance tuning concepts are covered? Thanks for your response. — Harry, Jan 28 '11 at 04:28

score 1 · Answer 2 · answered Jan 27 '11 at 16:32

Without having a better understanding of your application it's tough to answer some of this, but generally speaking you are correct that to achieve a "true" stress test of your server it would be ideal to have 100 cores (using a target of a 100 concurrent users), i.e. 100 PC's. Various issues, though, will probably show this as a no-brainer.

I have a communication engine I built a couple of years back (.NET / C#) that uses asyncrhonous sockets - needed the fastest speeds possible so we had to forget adding any additional layers on top of the socket like HTTP or any other higher abstractions. Running on a quad core 3.0GHz computer with 4GB of RAM that server easily handles the traffic of ~2,200 concurrent connections. There's a Gb switch and all the PC's have Gb NIC's. Even with all PC's communicating at the same time it's rare to see processor loads > 30% on that server. I assume this is because of all the latency that is inherent in the "total system."

We have a new requirement to support 50,000 concurrent users that I'm currently implementing. The server has dual quad core 2.8GHz processors, a 64-bit OS, and 12GB of RAM. Our modeling shows this computer is more than enough to handle the 50K users.

Issues like the network latency I mentioned (don't forget CAT 3 vs. CAT 5 vs. CAT 6 issue), database connections, types of data being stored and mean record sizes, referential issues, backplane and bus speeds, hard drive speeds and size, etc., etc., etc. play as much a role as anything in slowing down a platform "in total." My guess would be that you could have 500, 750, a 1,000, or even more users to your system.

The goal in the past was to never leave a thread blocked for too long ... the new goal is to keep all the cores busy.

I have another application that downloads and analyzes the content of ~7,800 URL's daily. Running on a dual quad core 3.0GHz (Windows Ultimate 7 64-bit edition) with 24GB of RAM that process used to take ~28 minutes to complete. By simply swiching the loop to a Parallel.ForEach() the entire process now take < 5 minutes. My processor load that we've seen is always less than 20% and maximum network loading of only 14% (CAT 5 on a Gb NIC through a standard Gb dumb hub and a T-1 line).

Keeping all the cores busy makes a huge difference, especially true on applications that spend allot of time waiting on IO.

Cirrus, appreciate your answer. Is there a book/resource where such performance tuning concepts are covered? Any Linux tools I could use to stress, monitor, and tune my overall system? I, e.g., don't know how to use each component in the system (hardware, software, network) to its peak capacities. I have a very broad (and thus vague) acquaintance with the individual components of an enterprise class system, but do not know of a systematic approach to measuring and tuning them. — Harry, Jan 28 '11 at 04:32
Harry, I honestly do not know if such a "singular" reference exists. I've been building client-server systems for over 20 years (though I specialize in mobile and mobile-to-backend integration that past 11). Most of what I know has been through the "school of knocking one's head against the wall" ... I have a Ph.D. The performance improvements we've documented have simply been by attempting to use the fastest coding patterns (parallel, async-everything), measure, adjust, test, measure, adjust, test .... and at some point being happy with the result. Not very scientific. — BonanzaDriver, Jan 28 '11 at 18:49

score 1 · Accepted Answer · answered Jan 29 '11 at 03:55

1

Not only are you using Ethernet, assuming you're writing web services you're talking over HTTP(S) which sits atop of TCP sockets, a reliable, ordered protocol with the built-in round trips inherent to reliable protocols. Sockets sit on top of IP, if your IP packets don't line up with your Ethernet frames you'll never fully utilize your network. Even if you were using UDP, had shaped your datagrams to fit your Ethernet frames, had 100 load generators and 100 1Gbit ethernet cards on your server, they'd still be operating on interrupts and you'd have time multiplexing a little bit further down the stack.

Each level here can be thought of in terms of transactions, but it doesn't make sense to think at every level at once. If you're writing a SOAP application that operates at level 7 of the OSI model, then this is your domain. As far as you're concerned your transactions are SOAP HTTP(S) requests, they are parallel and take varying amounts of time to complete.

Now, to actually get around to answering your question: it depends on your test scripts, the amount of memory they use, even the speed your application responds. 200 or more virtual users should be okay, but finding your bottlenecks is a matter of scientific inquiry. Do the experiments, find them, widen them, repeat until you're happy. Gather system metrics from your load generators and system under test and compare with OS provider recommendations, look at the difference between a dying system and a working system, look for graphs that reach a plateau and so on.

answered Jan 29 '11 at 03:55

Gareth Davidson

4,857
2
26
45

Gaz, I found your first 2 paras *really* insightful; your statement on 'interrupts' totally blew me over. Now, given what you said about the interrupts, it appears that true parallelism can never be achieved unless and until you have specialized hardware / OS / software... that can dedicate itself to processing a single 'task' given to it over the network, with the network being dedicated as well to a given user. It also appears that more parallelism may be achieved by scale-out h/w & s/w architectures than by scale-up. Is there a book/resource that covers identifying bottlenecks of each... – Harry Jan 29 '11 at 04:58
... abstraction/layer (from the end-user to the CPU that does the actual job)... via some intelligent calculations that can help reduce the scope of an otherwise brute-force approach of testing and measuring a very large combinatorial set of variables. Thanks, much! – Harry Jan 29 '11 at 05:01
1

I'm afraid that the type of understanding you're looking for would mean reading about every level of your system, starting at the protocols you're using at the highest level, down through understanding hardware architectures and the proprietary software running on it. Is "true" parallelism even possible, or is the universe also time-multiplexed? Such philosophical questions aren't very useful to the task at hand, in the practical world parallelism depends on your view of what constitutes a transaction... – Gareth Davidson Jan 29 '11 at 14:35
... I guess a book on network theory would give you a better theoretical base, but that's still academic rather than practical knowledge. It's better to start at the top layer and specialise when you need a deeper view of what's going on, be curious. In your case you need to understand your Windows system counters; CPU, RAM, page faults, run queue etc. Then your TCP/IP layer, your JVM, MySQL and UNIX counters. Systems *always* have bottlenecks, finding them is a matter of science and pre-empting them comes from experience... – Gareth Davidson Jan 29 '11 at 14:50
1

...however, any decent software engineer will tell you that early optimization is evil, you risk spending 10% of your time writing "optimizations" that give a 1% speed-up. The only sensible way to do things is to profile, gather understanding, rework problem areas and repeat until the costs outweigh the rewards. Load tests themselves are just another computer program, you should follow a similar methodology. Fully utilizing your kit may seem like a nice goal, but in the real world you just need to meet tangible requirements like "x concurrent requests, y per minute, 90% under z seconds" etc – Gareth Davidson Jan 29 '11 at 15:07
Thanks. (I think, you meant to say, "... repeat until the *rewards* outweigh the *costs*.") – Harry Jan 31 '11 at 15:19

score 0 · Answer 4 · answered Apr 02 '11 at 22:53

As you are representing users, disregard the rendezvous unless you have either an engineering requirement to maintain simultaneous behavior or your agents are processes and not human users and these agents are governed by a clock tick. Humans are chaotic computing units with variant arrival and departure windows based upon how quickly one can or cannot read, type, converse with friends, etc... A great book on the subject of population behavior is "Chaos" by James Gleik (sp?)

The odds of your 100 decoupled users being highly synchronous in their behavior on an instant basis in observable conditions is zero. The odds of concurrent activity within a defined time window however, such as 100 users logging in within 10 minutes after 9:00am on a business morning, can be quite high.

As a side note, a resume with rendezvous emphasized on it is the #1 marker for a person with poor tool understanding and poor performance test process. This comes from a folio of over 1500 interviews conducted over the past 15 years (I started as a Mercury Employee on april 1, 1996)

James Pulley

Moderator

-SQAForums WinRunner, LoadRunner

-YahooGroups LoadRunner, Advanced-LoadRunner

-GoogleGroups lr-LoadRunner

-Linkedin LoadRunner (owner), LoadrunnerByTheHour (owner)

Mercury Alum (1996-2000)

CTO, Newcoe Performance Engineering

Conceptual question on (a tool like) LoadRunner

4 Answers4