I have an odd situation. We have an AS400 with a Prod side and a Dev side (same machine, 2 nic cards) From a production SQL Server, we run a query from a MS-SQL server that is using a linked Server, I'll call 'as400' The query does 3 joins, and the execution plan looks roughly like [Remote Query] => Got Results. It does the joins on the remote server (the Production AS400) This will execute in no more than 0:00:02 (2 seconds) One of the joined tables has 391 MILLION rows. It is pulling 100 rows - joined to the other table.
Now, it gets weird. On the Dev side of the same machine, running on a different SQL Server, coming in the other NIC card, executing the same query with a different database (the dev one) the execution plan is quite different! It is:
[Query 1] hash match (inner join) with [Query2] Hash with [Query3] Hash with [Query4]
Expecting that each query returns 10,000 rows (I'm sure it is just using that as a guess as it doesn't know the actual rows as it is remote). What it appears to be doing is pulling 391 million rows back on query2 and it takes > 23 HOURS before I give up and kill the query. (Running in SSMS)
Clearly, the MS SQL Server is making the decision to not pass this off to the AS400 as a single query. Why?
I can 'solve' the problem by using a OpenQuery (as400, cmd) instead, but then it will open us up to SQL Injection, can't do simple syntax checking on the query, and other things I don't like. It takes 6 seconds to do the query using OpenQuery, and returns the correct data.
If we solve this by rewriting all our (working, fast) queries that we use in production so they can also run against dev - it involves a LOT of effort and there is down-side to it in actual production.
I tried using the 'remote' hint on the join, but that isn't supported by the AS400 :-(
Tried looking at the configuration/versions of the SQL Servers and that didn't seem to offer a clue either. (SQL Servers are nearly the same version/are same, 10.50.6000 for the one that works, and 10.50.6220 for one that fails (newer), and also 10.50.6000 for the other one that is failing.)
Any clues anyone? Would like to figure this out, we have had several people looking at this for a couple of weeks - including the Database Architect and the IBM AS400 guru, and me. (So far, my OpenQuery is the only thing that has worked)
One other point, the MS Servers seem to be opening connections 5 per second to the AS400 from the machines that are not working (while the query runs for 23 hours) - I don't understand that, and I'm not 100% sure it is related to this issue, but it was brought up by the AS400 guy.