0

I'm trying to modify this query, using levhenstein. My problem its that I can add my condition in inner join query, not in where query.

This is my actual query:

SELECT     
    registrosImportados.id, TVC.NOMBRETVC, 
    VIACU.NOMBREVIACU, CUT.GLOSA, 
    registrosImportados.direccion,
    dbo.fn_LevenshteinDistance(registrosImportados.direccion, VIAIU.NOMBREVIAIU, 5)
FROM         
    ACOPIOCI 
INNER JOIN
    TVC ON ACOPIOCI.IDTVC = TVC.IDTVC 
INNER JOIN
    VIACU ON ACOPIOCI.IDVIACU = VIACU.IDVIACU 
INNER JOIN
    VIAIU ON ACOPIOCI.IDVIAIU = VIAIU.IDVIAIU 
INNER JOIN
    CUT ON ACOPIOCI.IDCUT = CUT.IDCUT 
INNER JOIN
    registrosImportados ON CUT.CUT = registrosImportados.cut
WHERE
    dbo.fn_LevenshteinDistance(registrosImportados.direccion, VIAIU.NOMBREVIAIU, 5) = (SELECT MIN(M.DISTANCIA)
                     FROM 
                         (SELECT dbo.fn_LevenshteinDistance(registrosImportados.direccion, VIAIU.NOMBREVIAIU, 5) AS DISTANCIA
                          FROM ACOPIOCI 
                          INNER JOIN VIAIU ON ACOPIOCI.IDVIAIU = VIAIU.IDVIAIU 
                          WHERE ACOPIOCI.IDCUT = CUT.IDCUT) AS M
    )
    AND ACOPIOCI.IDCUT = CUT.IDCUT

And this is what I want to do:

SELECT      
    VIACU.NOMBREVIACU, registrosImportados.DIRECCION,
    dbo.fn_LevenshteinDistance(registrosImportados.direccion,  VIAIU.NOMBREVIAIU, 2) DIST
FROM         
    ACOPIOCI 
INNER JOIN
    TVC ON ACOPIOCI.IDTVC = TVC.IDTVC 
INNER JOIN
    VIACU ON ACOPIOCI.IDVIACU = VIACU.IDVIACU 
INNER JOIN
    VIAIU ON ACOPIOCI.IDVIAIU = VIAIU.IDVIAIU 
INNER JOIN
    CUT ON ACOPIOCI.IDCUT = CUT.IDCUT 
INNER JOIN
    registrosImportados ON CUT.CUT = registrosImportados.cut 
INNER JOIN
    (SELECT  
        MIN(dbo.fn_LevenshteinDistance(registrosImportados.direccion, VIAIU.NOMBREVIAIU, 2)
    ) AS MINDIST) M ON registrosImportados.DIST = M.MINDIST

But it doesn't work. My idea is to get performance in my query.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Patricio
  • 137
  • 1
  • 14
  • 1
    Show us the code for `dbo.fn_LevenshteinDistance(..)`. This query will not perform well because a scalar function is a poor format for such a function in SQL Server. – RBarryYoung Apr 21 '15 at 18:04
  • 3
    What does "dont work" mean? – Aaron Bertrand Apr 21 '15 at 18:11
  • The value of the last parameter in the call to fn_levenshteinDistance is different in the first query than in the second query. Without knowledge if what this function does this is hard for us to know if this matters or not. Also, using a function in the select clause of a query is always bad for performance, since the database will have to do this function for every record of the result set. If the function is just a few lines of code than you can gain spectaculair performance by eliminating the function and just put that code here in a subquery. – GuidoG Apr 21 '15 at 18:44

1 Answers1

2

The best solution to this problem is to store the Levenshtein distance computation results in a table, possibly as a computed column, so that updates are automatic, such that you can put an index on that value. Until you do this, you're gonna have performance issues.

That said, we can at least make some improvements to the query. The distance calculation is currently used 3 times in your query: for the selection list, in the where clause, and to compute the minimum distance. I expect Sql Server is smart enough to use the same calculation for both the SELECT and WHERE uses (You should verify this by checking the execution plan), but I doubt it's able to do that for the calculating the minimum distance.

This means the query is effectively executed twice: once for the base data, and once to compute the minimum distance; almost all of the work from the base data query has to be duplicated. We can make the query faster by pulling the unfiltered results from the main part of the query (that include the levenshtein distance calculation) into a temp table or CTE (the best option depends mainly on the size of the data), and then selecting from that table/CTE. This will help Sql Server know how to avoid computing the distance twice for every possible result record.

For this example, I'll use the CTE option:

With Data As 
(
   SELECT     
      r.id, b.NOMBRETVC, c.NOMBREVIACU,
      CUT.GLOSA, r.direccion,
      dbo.fn_LevenshteinDistance(r.direccion,i.NOMBREVIAIU,2) As Distance
   FROM ACOPIOCI a
   INNER JOIN TVC b ON a.IDTVC = b.IDTVC 
   INNER JOIN VIACU c ON a.IDVIACU = c.IDVIACU 
   INNER JOIN VIAIU i ON a.IDVIAIU = i.IDVIAIU 
   INNER JOIN CUT ON a.IDCUT = CUT.IDCUT 
   INNER JOIN registrosImportados r ON CUT.CUT = r.cut
)
SELECT id, NOMBRETVC, NOMBREVIACU
FROM Data
WHERE Distance = (SELECT MIN(Distance) FROM DATA)

If you have a lot of rows in the CTE (more than you have free memory), you'll do better using a temp table or table variable instead.

Community
  • 1
  • 1
Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • Tnak you very much! you're a MASTER! @Joel Coehoorn! – Patricio Apr 21 '15 at 19:46
  • I'd love to know before and after query times for this. – Joel Coehoorn Apr 21 '15 at 20:28
  • with 100.000 records: After 2 hours, now with your code 10 minutes. It's amazing!. I'm not using temp tables, it's ok? – Patricio Apr 21 '15 at 21:18
  • If you can switch to storing the levenshtein distance in a computed column, you could likely make this query finish instantly. Forget 10 minutes. – Joel Coehoorn Apr 21 '15 at 21:52
  • Thank you very much for your time. What I'm doing here (or trying) is standardize street addresses. I've a list of wrong address, example I've this address "11 Phln Av Sn Frncsco, CAA". I get street name and compare against a list of knowledge street address (900.000 records). If (my street name) "Phln" = (my knowledge database) "Phln", I can get the associated correct from another table (correct address) "Phelan" and so I can standardize addresses. Well, if I dont've this error, I use levhenstein distance trying to standardize. – Patricio Apr 22 '15 at 14:17
  • I would like use something like this http://www.analysisandsolutions.com/software/addr/, they use dictionaries. Do you think is better? The proble is that my address dictionary would be very long! 900.000 records at least. – Patricio Apr 22 '15 at 14:19
  • Dictionaries are made for large collections, but I'm not at all familiar with the product, to know how it would fit in. – Joel Coehoorn Apr 22 '15 at 15:23