-1

I have two tables, T1 and T2, every table has ten columns: teacher1, teacher2, teacher3, teacher4, teacher5, student1, student2, student3, student4, student5.

How to get the rows where T1 and T2 meets the following two rules?

In the rules student* means student1 student2 ... teacher* means teacher1 teacher2...

  1. T1 and T2 are different, which means at least one of T1's values is not in the T2.
    (Despite the column order, T1 teacher1 can appear in T2 student* or teacher* column)
  2. At least one of T1 teacher* is in T2 teacher* OR at least one of T1 student* is in T2 student*, which means that T1 and T2 has at least a common teacher or student.

for example if T1 has one row

C2      NULL    NULL    NULL    NULL    S1      NULL    NULL    NULL    NULL

and T2 has the rows

NULL    NULL    NULL    S1      NULL    NULL    C2      NULL    NULL    NULL
S1      NULL    NULL    C3      NULL    C2      NULL    NULL    NULL    NULL
NULL    NULL    NULL    C2      NULL    S2      NULL    S3      NULL    NULL

the expected results is only the third row of T2 because:

  • the first row meet rule2 but not rule1.
  • the second row meet rule1 but not rule2.
  • the third row meet rule1 and rule2.

I wanted to use the CharIndex function, but every table has more than one million rows, it's incredibly slow.

Serpiton
  • 3,676
  • 3
  • 24
  • 35
Yinbin Wang
  • 113
  • 6
  • 3
    I have tried to understand your question, but it is confusing. Please review your T1 & T2 rules to see if you can make this clearer. Also, are you only looking for record that meet T1 and T2. Finally, your comment re: charindex is confusing, If you have 5 separate columns for teachers and students, I don't see why you would want to use charindex based on your problem description. – Gary Walker Jul 15 '14 at 06:30
  • 1
    That (Numbered columns that contain the same "sort" of data) sounds like a really poor table design. – Damien_The_Unbeliever Jul 15 '14 at 06:52
  • @Damien_The_Unbeliever you are right,It is really a poor design.But it is now exists and I have nothing to do with it. – Yinbin Wang Jul 15 '14 at 06:55
  • @GaryWalker use charindex because I can add every columns in T2 and compare each column with that. – Yinbin Wang Jul 15 '14 at 06:57
  • Do you mean you concatenate the columns together so you can do a single comparison? If so, it makes sense. – Gary Walker Jul 15 '14 at 06:59
  • Is there any Primary Key in your tables? Do you really want to check every row in T1 to every row in T2? – dnoeth Jul 15 '14 at 07:31
  • Please answer @dnoeth question -- a cartesian product will be incredibly huge and nothing will help -- I've been assuming you would not be trying to do that. – Gary Walker Jul 15 '14 at 07:40
  • @dnoeth yes,table T1 and T2 has its own primary key ,but their primary key is different. – Yinbin Wang Jul 15 '14 at 08:03

2 Answers2

0

If you are stuck with this bad design, you don't have much in the way of useful options, but you might find the solution is to stop trying to do this in TSQL. You are pretty much forced to do a full table scan (unless there is some where clause subsetting you did not mention, but the same where clause would continue to work in terms of avoiding a full table scan).

TSQL has pretty bad performance for lots of string manipulation. Write your tests in C# or some other compiled language and they will run a great deal faster than equivalent TSQL code.

You can either write a CLR proc (SqlServer 2005+) or a old-style client server application. In either case, you will basically suck in the entire table and do all the necessary comparisons in compiled code. Sucking all of the data into a client will be relatively slow, especially if you have a slow network connection between the client and server.

You don't say what you intend to do with the records once you select them or roughly how many you are selecting, but unless you end up selecting more than a few thousand rows, the performance issues should not really make you decide CLR vs. traditional client server.

As a first approximation, just write a quick and dirty program that simple reads every record and see if the processing time is more or less acceptable.

Do not keep all of the records in memory as you process them. I.e., process one record at a time and do your rules testing and discard the record if it does not match your rules.

If you have to post large amounts of data, this will be very slow if you do it a record at a time. --- which is why you try to avoid client (or CLR) based processing of large result sets.

Also, I am assuming you can keep the smaller table in memory and have a viable strategy to reduce the time comparing every row in T1 against every row in T2

ADDED

The only other thing I can think of is altering your design, at least to the point where you create supplemental scratch tables that are properly normalized and build them as needed to use in you query. Sounds like a lot of work and it won't be fast either as writing those scratch tables will be a lot of I/O and indexing and then you still have to do all the rules in TSQL, so more I/O, etc.

Gary Walker
  • 8,831
  • 3
  • 19
  • 41
  • Thank you for your suggetions.what do you think If I convert row to column ,and save the results in a new table? – Yinbin Wang Jul 15 '14 at 07:27
  • Rephrase more carefully -- don't know what you are asking. – Gary Walker Jul 15 '14 at 07:30
  • I cannot describe this idea in english precisily. I want to create two new tables.table1 has the column :C2 NULL S1 ,and has the data:teacher1,teacher2,teacher3,teacher4,teacher5,student1,student2,student3,student4,student5 Like this question:(http://stackoverflow.com/questions/15745042/efficiently-convert-rows-to-columns-in-sql-server) – Yinbin Wang Jul 15 '14 at 08:14
0

Why don't you try adding checksum columns and match the same with your driver columns?

ALTER TABLE Table_Name
   ADD Checksum_Column_Name AS CHECKSUM(Column1, Column2, Column3,...);
GO 

The checksums will be equal if the data is same.

Jay
  • 1,980
  • 1
  • 13
  • 23
  • Won't help with the problem as described. Otherwise sound thinking in terms of trying a make a useful hash of the data. – Gary Walker Jul 15 '14 at 07:33