1

I am looking for help on how to speed up the code bit below because as it stands, it is taking too long to perform the task. Any suggestions would be much appreciated. Thanks in advance!

The code bit below is a stripped down version of the actual version but all the important guts should be there. The code works; however, the code is really slow on even a modest size dataset. Needless to say, the primary culprit is the second, nested recordset/SQL call. The LIKE operator is part of the slowdown but I'm more concerned about the nesting and I think the LIKE operator will be required in what we're trying to accomplish. I tried nesting the second SQL call into the first but I didn't see a clean way of doing so.

Platform: Classic ASP, VBScript, MS Access DB

' Go through all people in the table.
sql1 = "SELECT ID, FN, LN, Email FROM Table1"
Call rst1.Open(sql1, cnx, 0, 1)
While Not rst1.EOF
    id = rst1.Fields("ID").Value
    fn = rst1.Fields("FN").Value
    ln = rst1.Fields("LN").Value
    email = rst1.Fields("Email").Value
    If IsNull(email) Or IsEmpty(email) Then
        email = ""
    End If

    ' ----- Figure out if any other people in the table has a similar name or is using the same e-mail address.
    '       Capture both the ID of those other people as well as figure out the total number of possible duplicates.
    sql2 = "SELECT ID FROM Table1"
    sql2 = sql2 & " WHERE"
    sql2 = sql2 & "     ID <> " & id
    sql2 = sql2 & "     AND"
    sql2 = sql2 & "         ("
    sql2 = sql2 & "             FN & ' ' & LN LIKE '%" & Replace(fn & " " & ln, "'", "''") & "%'"
    If email <> "" Then
        sql2 = sql2 & "         OR"
        sql2 = sql2 & "         Email LIKE '%" & Replace(email, "'", "''") & "%'"
    End If
    sql2 = sql2 & "         )"
    Call rst2.Open(sql2, cnx, 0, 1)
    numDups = 0
    possibleDups = ""
    While Not rst2.EOF
        numDups = numDups + 1
        If possibleDups <> "" Then
            possibleDups = possibleDups & ", "
        End If
        possibleDups = possibleDups & rst2.Fields("ID").Value
        Call rst2.MoveNext()
    Wend
    Call rst2.Close()
    ' ----- End nest query.

    Call Response.Write(fn & " " & ln & " has " & numDups & " possible duplicates (" & possibleDups & ")")

    Call rst1.MoveNext()
Wend
Call rst1.Close()

Update 1:

Per request, here is a bit more info on the sample data and the expected output. Table1 is basically a table with the fields: id, fn, ln, email. id is an autogenerated ID representing the entry and fn/ln represent the first/last name, respectively, of the person's entry. Expected output is as coded, e.g.,...

John Doe has 3 possible duplicates (1342, 3652, 98325)
John Doe has 3 possible duplicates (986, 3652, 98325)
John Doe has 3 possible duplicates (986, 1342, 98325)
John Doe has 3 possible duplicates (986, 1342, 3652)
Sam Jones has 0 possible duplicates ()
Jane Smith has 2 possible duplicates (234, 10562)
Jane Smith has 2 possible duplicates (155, 10562)
Jane Smith has 2 possible duplicates (155, 234)

The numbers in parentheses correspond to the id's that appear to be duplicates to each person. A possible duplicate is a scenario in which another entry in the same table appears to share the same name or e-mail. For example, there could be 4 John Doe's and 3 Jane Smith's in the table based on name alone.

Ideally, only one SQL query is required to reduce the roundtrip induced by the recordset call but Access is limited compared to regular SQL Server as far as features and I'm not sure what I'm missing that might help speed this up.

Update 2:

Using the SQL Fiddle by @Abecee, I was able to get a faster query. However, I am now encountering two problems as a result.

The big picture view is still the same. We are looking for possible duplicates based on first name, last name, and e-mail address. However, we also added a search criteria, which are the lines wrapped inside of If searchstring <> "" Then ... End If. Also, note that the e-mail info is now being pulled from a separate table called EmailTable with the fields id, IndividualID (representing Table1.id), and email.

Mods: The updated query is similar but slightly different from the original query above. I'm not sure if it's better to create a whole new question or not, so I'll just leave this here for now. Let me know if I should move this to its own question.

If the code associated with comment A below is uncommented sql1 = sql1 & " OR (INSTR(E1.Email, E2.Email) > 0) ", I get an error message: Microsoft JET Database Engine (0x80040E14) Join expression not supported. The query seems to be coded correctly so what is missing or incorrect?

If the code associated with comment B below is uncommented sql1 = sql1 & " OR INSTR(E1.Email, '" & Replace(searchstring, "'", "''") & "') > 0", the query runs but it hangs. I tried dropping the query directly into Access to see if it'll work (e.g., New Query > SQL View) but it also hangs from within Access. I think the syntax and logic are correct but obviously something is askew. Do you see what or why it would hang with this line of code?

Here is the updated query:

sql1 = sql1 & "SELECT "
sql1 = sql1 & " T1.ID, T1.FN, T1.LN, E1.Email, "
sql1 = sql1 & " T2.ID, T2.FN, T2.LN "
sql1 = sql1 & "FROM "
sql1 = sql1 & "     ((Table1 T1 LEFT JOIN [SELECT E1.* FROM EmailTable E1 WHERE E1.Primary = True]. AS E1 ON T1.ID = E1.IndividualID)"
sql1 = sql1 & "         LEFT JOIN (Table1 T2 LEFT JOIN EmailTable E2 ON T2.ID = E2.IndividualID) "
sql1 = sql1 & "         ON "
sql1 = sql1 & "             ("
sql1 = sql1 & "                 T1.ID <> T2.ID "
sql1 = sql1 & "                 AND "
sql1 = sql1 & "                     ("
sql1 = sql1 & "                         ((INSTR(T1.FN, T2.FN) > 0) AND (INSTR(T1.LN, T2.LN) > 0)) "
' A. When the following line is uncommented, error is "Join expression not supported."
' sql1 = sql1 & "                       OR (INSTR(E1.Email, E2.Email) > 0) "
sql1 = sql1 & "                     ) "
sql1 = sql1 & "             ) "
sql1 = sql1 & "     ) "
If searchstring <> "" Then
    sql1 = sql1 & " WHERE "
    sql1 = sql1 & "     INSTR(T1.FN & ' ' & T1.LN, '" & Replace(searchstring, "'", "''") & "') > 0"
    ' B. When the following line is uncommented, code hangs on the rst1.open() call."
    ' sql1 = sql1 & "   OR INSTR(E1.Email, '" & Replace(searchstring, "'", "''") & "') > 0"
End If
sql1 = sql1 & " ORDER BY T1.LN, T1.FN, T1.ID"

prevID = 0

Call rst1.Open(sql1, cnx, 0, 1)
While Not rst1.EOF
    id = rst1.Fields("ID").Value

    ' Get initial values if we've come across a new ID.
    If (id <> prevID) Then
        fn = rst1.Fields("T1.FN").Value
        ln = rst1.Fields("T1.LN").Value
        email = rst1.Fields("Email").Value
        If IsNull(email) Or IsEmpty(email) Then
            email = ""
        End If

        ' Reset the counter for how many possible duplicates there are.
        numDups = 0

        ' If there is an ID from the second table, then keep track of this possible duplicate.
        tmp = rst1.Fields("T2.ID").Value
        If IsNumeric(tmp) Then
            tmp = CLng(tmp)
        Else
            tmp = 0
        End If
        If tmp > 0 Then
            numDups = numDups + 1
            possibleDups = possibleDups & tmp
        End If
    End If

    ' Figure out if we should show this row. Within this logic, we'll also see if there is another possible duplicate.
    showrow = False
    Call rst1.MoveNext()
    If rst1.EOF Then
        ' Already at the end of the recordset so show this row.
        showrow = True
        Call rst1.MovePrevious()
    Else
        If rst1.Fields("T1.ID") <> lngIndividualIDCurrent Then
            ' Next record is different T1, so show this row.
            showrow = True
            Call rst1.MovePrevious()
        Else
            ' Next record is the same T1, so don't show this row but note the duplicate.
            Call rst1.MovePrevious()
            ' Also, add the new T2 as a possible duplicate.
            tmp = rst1.Fields("T2.ID").Value
            If IsNumeric(tmp) Then
                tmp = CLng(tmp)
            Else
                tmp = 0
            End If
            If tmp > 0 Then
                numDups = numDups + 1
                If possibleDups <> "" Then
                    possibleDups = possibleDups & ", "
                End If
                possibleDups = possibleDups & tmp
            End If
        End If
    End If

    If showrow Then
        Call Response.Write(fn & " " & ln & " has " & numDups & " possible duplicates (" & possibleDups & ")")
    End If

    Call rst1.MoveNext()
    prevID = id
Wend
Call rst1.Close()
Shadow The GPT Wizard
  • 66,030
  • 26
  • 140
  • 208
jiminy
  • 1,612
  • 3
  • 18
  • 21
  • 1
    This type of question may be better suited to [codereview.se] – crthompson Dec 09 '14 at 20:51
  • Could you possibly provide some sample data with expected output? – Abecee Dec 09 '14 at 21:57
  • 1
    No MS Access at hand to test. But you should get the idea from [SQL Fiddle](http://sqlfiddle.com/#!3/56334/1) Will need to replace `CHARINDEX` with `INSTR`. From there, http://stackoverflow.com/questions/5517233/ms-access-query-concatenating-rows-through-a-query should get you closer to what you are looking for. The way, it is currently set up, it is not symmetric (See especially 1 -> 2 but no 2 -> 1.), which could be addressed if actually required. – Abecee Dec 10 '14 at 01:42
  • 1
    check this too:[Improving String Handling Performance in ASP Applications](http://msdn.microsoft.com/en-us/library/ms972323.aspx) – Flakes Dec 10 '14 at 07:18
  • @SearchAndResQ Thanks for the tip but in my scenario, the string concatenation doesn't factor heavily since the strings being concatenated are relatively short. Just to be sure, I tested improving the string handling but there was no noticeable speed increase. – jiminy Dec 10 '14 at 23:32
  • @Abecee Thanks for the tip. That looks promising. I'll give it a shot and report back. – jiminy Dec 10 '14 at 23:34
  • @Abecee Your SQL Fiddle helped. Thank you again. Now I'm going to take it a step further and will update the original question with the dilemma. – jiminy Dec 11 '14 at 04:05
  • Best tip I can give, **don't do it**. Especially don't do this with `ADODB.Recordset`. At least use an array from calling `.GetRows()`. – user692942 Dec 15 '14 at 12:46
  • @Lankymart I tried the advice given by the other folks as well as your advice to use GetRows(). Unfortunately, the queries didn't speed up enough nor could I find what are causing the problems noted in the update #2. I don't think GetRows() sped things up much because the bottleneck appears to be at the DB. Needless to say, without moving to a better DB/system, there doesn't seem much more to do so I'm going to leave this at that. Nonetheless, if someone in the future sees anything else that would help, I'd still be interested in hearing the thoughts. – jiminy Dec 15 '14 at 20:20
  • The bottleneck when using Access is always going to be the DB it's not designed for web architecture. – user692942 Dec 15 '14 at 21:05
  • Are you still working on this? – Abecee Jan 11 '15 at 14:48
  • Not any more. We ended up just leaving it as is. We couldn't move the DB to SQL and we ended up limiting who could access the script so there the slowdowns would be limited to just a few users. I'd still like to know a work-around if possible, but it doesn't seem so. – jiminy Jan 12 '15 at 05:13
  • @jiminy In case, this is still at the back of your mind: Did you try to move the `OR (INSTR(E1.Email, E2.Email) > 0)` part from the `JOIN` condition to a/the `WHERE` clause? – Abecee Jul 13 '15 at 09:23

1 Answers1

3

Yes, that's going to be slow because LIKE '%whatever%' is not sargable. So, if [Table1] has 1,000 rows then at best you'll be retrieving the other 999 rows for each row in the table, which means that you'll be pulling 999,000 rows in total.

A few observations:

  1. You are performing the comparisons for every row in the table against every other row. That would be something that you might want to do one time only to find possible dups in legacy data, but as part of the normal operation of an application we would expect to compare one record against all of the others (i.e. the one record that you are inserting or updating).

  2. You are looking for rows WHERE 'fn1 ln1' LIKE('%fn2 ln2%'). How is that significantly different from WHERE fn1=fn2 AND ln1=ln2? That would be sargable, so if you had indexes on [FN] and [LN] then that could speed things up a great deal.

  3. You really should NOT be using an Access database as the back-end for a web application (ref: here).

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • Thanks for the input Gord! 1. Yes, this is a case where the code will only be used sparingly but when used, I'd like for this to be a little more pleasant to use time-wise. 2. That idea was to try to catch cases like Sean Jack, which may be similar to DeSean Jackson. Obviously, it doesn't catch all permutations but it does catch a common scenario for our users. I tried switching it over to your suggestion and I didn't notice enough of a speedup. 3. Agreed about Access, but no choice in this case. – jiminy Dec 09 '14 at 23:08
  • Indices were added to [FN] and [LN]. There was a noticeable speed increase but it still takes a while. – jiminy Dec 09 '14 at 23:18