0

I've seen this question asked in many ways all over the Internet but despite implementing the abundance of advice (and some voodoo), I'm still struggling. I have a 100GB+ database that is constantly inserting and updating records in very large transactions (200+ statements per trans). After a system restart, the performance is amazing (data is written to a large SATA III SSD connected via USB 3.0). The SQL Server instance is running on a VM running under VMWare Workstation. The host is set to hold the entire VM in memory. The VM itself has a paging cache of 5000 MB. The SQL Server user is set to 'hold pages in memory'. I have 5 GBs of RAM allocated to the VM, and the max memory of the SQL Server instance is set to half a Gig.

I have played with every single one of these parameters to attempt to maintain consistent performance, but sure and steady, the performance eventually degrades to the point where it begins to time out. Here's the kicker though, if I stop the application that's loading the database, and then execute the stored proc in the Management Studio, it runs like lightning, clearly indicating it's not an issue with the query, and probably nothing to do with memory management or paging. If I then restart the loader app, it still crawls. If I reboot the VM however, the app once again runs like lightning...for a while...

Does anybody have any other suggestions based upon the symptoms presented?

Sean
  • 103
  • 1
  • 10
  • 1
    Check indices. Likely someone has missed the basics. Like 95% of the time people complain about SQL performance. – TomTom Dec 05 '14 at 14:23
  • Lot's of comments regarding the SQL or schema itself, but as evidenced by restored performance after restarting the server, the SQL is not the issue. – Sean Dec 05 '14 at 14:51
  • 1
    For my experience, "half a Gig" is very low memory for SQL Server. SQL server version? I suggest you to collect informations with activity monitor to see the suffering resource. – user_0 Dec 05 '14 at 16:18
  • I tried many different amounts of memory, both SQL Server and VM RAM allocation, to no avail. SQL Server version 2008, latest service pack – Sean Dec 05 '14 at 17:16

5 Answers5

2
  • Depending on how large your hot set is, 5GB memory may just tax it for a 100+gb database.

  • Check indices and query plans. We can not help you without them. And I bet you miss some indices - which is the standard performance issue people have.

  • Otherwise, once you made your homework - head over to dba.stackexchange.com and ask there.

  • Generally - consider that 200 statements per transaction may simply indicate a seriously sub-optimal programming. For example you could bulk-load the data into a temp table then merge into the final one.

TomTom
  • 61,059
  • 10
  • 88
  • 148
  • 5GB is actually abundant considering most of my data is new and that caching will be of little benefit, regardless of how much memory I allocate to SQL Server (I've proven this, performance is unaffected no matter what my memory/cache sizes are). Also, the data for these 200+ statements are implicitly interconnected so they'd still need to be part of a transaction during your suggested merge, even if pre-loaded into temp tables (an extra step that really doesn't change the nature of the final inserts/updates) – Sean Dec 05 '14 at 16:00
  • Glad to know I'm on the right path per your last bullet. Could it also be that the query is being performed during work hours and the number of connections is upped? – interesting-name-here Feb 13 '17 at 14:48
  • I had a query that was crawling and then it completed much faster than I expected after 5pm – interesting-name-here Feb 13 '17 at 14:49
1

Actually, I may have a working theory. What I did was add some logic to the app that when it times out, sit for two minutes, and then try again, and voila! Back to full speed. I rubber-ducky'd my co-worker and came up with the concept that my perceived SSD write speeds were actually the write speed to the VMWare host's virtual USB 3 buffer, and that the actual SSD write speeds were slower. I'm probably hitting against the host's buffer size and by forcing the app to wait 2 minutes, the host has a chance to dump its back-buffered data to the SSD. Elementary, Watson :)

If this approach also fails to be sustainable, I'll report in.

Sean
  • 103
  • 1
  • 10
  • If that's the case, then you should still be able to see the problem somewhere. What does the VMWare host think the `await` from `iostat` for that disk is? What does the SQL Server think it's disk queue length is? Also, why are you writing through USB 3 instead of SATA? Or is the disk actually physically connected with USB 3.0? Are your transaction logs on the same volume, or connected to the same USB controller? If that's the case I'm not at all surprised you have problems. USB isn't meant to handle the load from an SQL Server with a 100 GB db. Even with SSD, I'd expect RAID 10. – Bacon Bits Dec 05 '14 at 16:04
  • The host is a tablet. I don't have iostat for windows and perfmon won't let me add counters. This is just my personal development configuration and I wasn't expecting these types of issues to cascade down from the host, lol. The SSD is connected via USB 3 to the host and all the database, logs, etc, reside on the virtual machine's virtual disk on the physical SSD. I don't necessarily agree with your posit that USB 3 can't handle this scenario because the issue probably would also manifest in your RAID 10. – Sean Dec 05 '14 at 16:34
  • My data throughput is higher than most storage hardware's bandwidth, but because of the virtualization, the hardware bottlenecking and subsequent pacing that would normally occur, is only occurring at the host. Meanwhile, the VM is led to believe that the storage has ample bandwidth so it doesn't perceive a bottleneck and continues to dump data into the host's USB 3 buffer until the host says, stop, I'm full! and then I get timed out. I've seen this behavior before in other unrelated scenarios where the guest is dumping huge data to slow host-virtualized storage. – Sean Dec 05 '14 at 16:35
  • I believe VMWare is flawed in its USB virtualization implementation because it introduces hard-stopping behavior that is not exhibited by native USB, hence the confusing behavior of SQL Server in a VM when handling very high data throughputs. This is of course very theoretical, but at this time, it's been over a couple hours and my app is still screaming along nicely, whereas I'd be lucky to get ten minutes of good speed before. – Sean Dec 05 '14 at 16:35
  • So, you're on consumer-grade mobile hardware, running a VM (in Workstation, I assume) with both SQL Server and, presumably, the loader app running, using a single externally attached disk as storage? There are so many places where you could run into problems putting a load on that. It could be as simple as the USB 3 controller in the tablet isn't connected to the system bus on a full link because the tablet doesn't need that kind of bus speed, or the USB controllers or drivers or disk itself don't support UAS, etc. Do you have a test environment with better hardware? – Bacon Bits Dec 05 '14 at 17:42
  • It's my opinion that running this configuration on consumer grade hardware will expose ceilings that would be much higher (but still present) on more robust hardware. Since I've implemented this workaround (a pause that allows the VM host to flush its USB 3 cache), it's been flying along famously with only the expected intermittent hiccups that are being accommodated. I'm much more confident of its pending production deployment now that I can get it to work on my toys so well. I would classify this issue as a nascent race condition, and I think that makes sense to code for. – Sean Dec 05 '14 at 18:24
0

Try executing this to determine your problem queries:

SELECT TOP 20
  qs.sql_handle,
  qs.execution_count,
  qs.total_worker_time AS Total_CPU,
  total_CPU_inSeconds = --Converted from microseconds
      qs.total_worker_time/1000000,
  average_CPU_inSeconds = --Converted from microseconds
      (qs.total_worker_time/1000000) / qs.execution_count,
  qs.total_elapsed_time,
  total_elapsed_time_inSeconds = --Converted from microseconds
      qs.total_elapsed_time/1000000,
 st.text,
 qp.query_plan
FROM
 sys.dm_exec_query_stats as qs
 CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st
 cross apply sys.dm_exec_query_plan (qs.plan_handle) as qp
 ORDER BY qs.total_worker_time desc

Then check your estimated and actual execution plans on the queries this command helps you pinpoint.

Source How do I find out what is hammering my SQL Server? and at the bottom of the page of http://technet.microsoft.com/en-us/magazine/2007.11.sqlquery.aspx

Community
  • 1
  • 1
user8128167
  • 6,929
  • 6
  • 66
  • 79
0

Beyond the excellent indexing suggestions already given, be sure to read up on parameter sniffing. That could be the cause of the problem.

SQL Server - parameter sniffing

http://www.sommarskog.se/query-plan-mysteries.html#compileps

As a result you could have a bad query plan being re-used, or SQL's buffer could be getting full and writing pages out of memory to disk (maybe that's other allocated memory in your case).

You could run DBCC FreeProcCache and DBCC FreeSystemCache to empty it and see if you get a performance boost.

You should give SQL more memory too - as much as you can while leaving room for other critical programs and the OS. You might have 5gb of Ram on the VM, but SQL is only getting to play with a 1/2 gb, which seems REALLY small for what you're describing.

If those things don't move you in the right direction, install the SQL Management Data Warehouse so you can see exactly what is happening when your slow down begins. Running it takes up additional memory, but you will give the DBA's more to go on.

Community
  • 1
  • 1
Jared_S
  • 166
  • 6
0

In the end, what I did was a combination of two things, putting in logic to recover when timeouts occurred, and setting the host core count to only reflect physical cores, not logical cores, so for example, the host has 2 cores that are hyper-threaded. When I set my VM to use 4 cores, it occasionally gets hung in some infinite loop, but when I set it to 2 cores, it runs without fail. Still, aberrant behavior like this is difficult to mitigate reliably.

Sean
  • 103
  • 1
  • 10