4

Relevant Technology: SQL Server 2008 R2 RAID 5 (4 disk) Windows Server 2008

To preface, our RAID 5 array had a disk partially fail. No failure was detected, but the drive light flashed amber periodically (solid amber indicates drive failure) after an unplanned power outage over the weekend and failure of the UPS. The outage was on a Saturday and I found the light Tuesday after noticing a "PAGEIOLATCH_SH" error and reading the post What is PAGEIOLATCH_SH wait type in SQL Server? (among others). We've replaced the drive and let it rebuild, but I'm still seeing the error.

The query is against a large table through a view which has several indexes on the underlying tables. I've rebuilt the indexes, re-saved the view in hopes of a better execution path, and simplified the query. Nothing has fixed the problem. The query has run without issue since 2006 and had no issues with the upgrade to SQL Server 2008 or to R2, both of which were applied when they first became available.

Originally the execution plan showed a fairly even distribution, but now it shows the majority on the second item, "Sort (Distinct Sort)", with about 30% split among Index Seeks. The time used to be between 2 and 10 seconds but now is upwards of 2 minutes.

At this point I'm not sure how to isolate what is causing the issue. I assume it's either corrupt data that I'm not finding or that the query has re-optimized itself to something that is far from optimal, or there's something wrong with the RAID that doesn't set off any lights or warnings.

I've done what PAGEIOLATCH_SH and similar issues usually require, and the indexes not only seem right, but have worked for years up to this point. I've also done everything I know to do to make sure the drive is working. My question is basically how do I go about diagnosing the root of the problem in this situation?

EDIT: Found that the server did not actually go down with the power outage, but the rack beside it did. Not sure why the drive partially failed, but at this point it seems to be coincidental to the outage.

Community
  • 1
  • 1
Merennulli
  • 73
  • 2
  • 10

1 Answers1

4

Do you see many small PAGEIOLATCH_SH waits, or few large ones?

select * from sys.dm_os_wait_stats
where wait_type = 'PAGEIOLATCH_SH';

What is the exact results (count, sum wait time, max wait time).

Many small waits would indicate a change in the query plan. Comparing (if possible) the number of logical reads of the query with the baseline number would corroborate this (increase in the number of logical reads). Also, comparing the plans, if possible, would help isolating the issue.

Few large waits would indicate indeed a drive problem (long wait for IO). Error 833 logged in the ERRORLOG would corroborate this (SQL Server has encountered ... occurrence(s) of I/O requests taking longer than ... seconds to complete).

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • PAGEIOLATCH_SH 29275 301891 1350 890 – Merennulli Jun 01 '11 at 22:15
  • 1
    Avg. wait is ~10 ms, max wait was 1.3 sec. This is unlikely the explanation for a jump of execution time from 2 sec to 2 min. Plan change is much more probable. – Remus Rusanu Jun 01 '11 at 22:23
  • The numbers vary considerably depending on what queries run, many of which are automated processes. Unfortunately, I cannot shut off other queries to isolate this one. | I find no Error codes in ERRORLOG or its recent numbered friends except those related to failed logins around the last reboot and one about the OS at the time of the power failure. – Merennulli Jun 01 '11 at 22:31