0

I need to search for people whose FirstName is included (a substring of) in the FirstName of somebody else.

SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]       
    FROM [dbo].[people] people
    INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
    WHERE EXISTS (SELECT * 
                        FROM [dbo].[people_NAME] peopleName2 
                        WHERE peopleName2.[Id] != people.[id] 
                            AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')

It is so slow! I know it's because of the "'%' + peopleName.[FirstName] + '%'", because if I replace it with a hardcoded value like '%G%', it runs instantly.

With my dynamic like, my top 10 takes mores that 10 seconds! I want to be able to run it on much bigger database.

What can I do?

Dale K
  • 25,246
  • 15
  • 42
  • 71
Shan9588
  • 50
  • 5
  • 2
    That's what happens when using `LIKE` with a wildcard on the left side... You can try and create an index on `people_NAME(FirstName)` but most probably this will not make a real performance difference. – GMB Oct 10 '19 at 00:20
  • this query is just the same to this one `SELECT DISTINCT top 10 p.[Id], pn.[LastName], pn.[FirstName] FROM [dbo].[people] p INNER JOIN [dbo].[people_NAME] pn on pn.[Id] = p.[Id]` – Ed Bangga Oct 10 '19 at 00:21
  • inner join and exist on same table? – Ed Bangga Oct 10 '19 at 00:22
  • I need to fin the people with name which are included in at least one other people's name. Exemple if I have Bob and Bobby, Bob will be a result. I tried adding some index but no difference :(! – Shan9588 Oct 10 '19 at 00:27
  • [Full Text Index](https://learn.microsoft.com/en-us/sql/t-sql/statements/create-fulltext-index-transact-sql?view=sql-server-2017)? – Dale K Oct 10 '19 at 00:29
  • I have not tried to create a full text index, would it make any big difference? I'm reluctant to create some index as this is a one thing statistic thing that I need to do. – Shan9588 Oct 10 '19 at 00:30
  • You need to investigate the situation, but thats the purpose of a full text index... finding random bits of text in a performant manner. – Dale K Oct 10 '19 at 00:46
  • 1
    Actually adding the hardcoded value isn't what is speeding it up, `like '%anything` (starts with %) will be slow. The reason your hardcoded value is fast is because it finds 10 matches early in the scan. – Dale K Oct 10 '19 at 01:52
  • It's a physical table – Shan9588 Oct 10 '19 at 14:58
  • which version of SQL server are you running? – MtwStark Oct 10 '19 at 15:28
  • Microsft SQL Server Management Studio 11.0.2100.60 – Shan9588 Oct 10 '19 at 16:03

4 Answers4

1

This is a hard problem. I don't think a full text index will help, because you want to compare two columns.

That doesn't leave good options. One possibility is to implement ngrams. These are sequences of characters (say, 3 in a row) that come from a string. From my first name, you would have:

gor
ord
rdo
don

Then you can use these for direct matching on another column. Then you have to do additional work to see if the full name for one column matches another. But the ngrams should significantly reduce the work space.

Also, implementing ngrams requires work. One method uses a trigger which calculates the ngrams for each name and then inserts them into an ngram table.

I'm not sure if all this work is worth the effort to solve your problem. But it is possible to speed up the search.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Take a look at my answer about using LIKE operator here

It could be quite performant if you use some tricks

You can gain much speed if you play with collation, try this:

SELECT DISTINCT TOP 10 p.[Id], n.[LastName], n.[FirstName]       
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
WHERE EXISTS (
    SELECT 'x' x
    FROM [dbo].[people_NAME] n2
    WHERE n2.[Id] != p.[id]     
    AND 
        lower(n2.[FirstName]) collate latin1_general_bin 
        LIKE 
        '%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
)

As you can see we are using binary comparision instead of string comparision and this is much more performant.

Pay attention, you are working with people's names, so you can have issues with special unicode characters or strange accents.. etc.. etc..

Normally the EXISTS clause is better than INNER JOIN but you are using also a DISTINCT that is a GROUP BY on all columns.. so why not to use this?

You can switch to INNER JOIN and use the GROUP BY instead of the DISTINCT so testing COUNT(*)>1 will be (very little) more performant than testing WHERE n2.[Id] != p.[id], especially if your TOP clause is extracting many rows.

Try this:

SELECT TOP 10 p.[Id], n.[LastName], n.[FirstName]
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
INNER JOIN [dbo].[people_NAME] n2 on 
    lower(n2.[FirstName]) collate latin1_general_bin 
    LIKE 
    '%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
GROUP BY n1.[Id], n1.[FirstName]
HAVING COUNT(*)>1

Here we are matching also the name itself, so we will find at least one match for each name. But We need only names that matches other names, so we will keep only rows with match count greater than one (count(*)=1 means that name match only with itself).

EDIT: I did all test using a random names table with 100000 rows and found that in this scenario, normal usage of LIKE operator is about three times worse than binary comparision.

MtwStark
  • 3,866
  • 1
  • 18
  • 32
  • Thanks for you answer, I will check this out. In the meanwhile, i'm closing this post. I have a lot of expérimentations to do :). – Shan9588 Oct 10 '19 at 18:56
0

Have you tried a JOIN instead of a correlated query ?.

Being unable to use an index it won't have an optimal performance, but it should be a bit better than a correlated subquery.

SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]       
FROM [dbo].[people] people
     INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
     INNER JOIN [dbo].[people_NAME] peopleName2 on peopleName2.[Id] <> people.[id] AND
                                                   peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%'
Marc Guillot
  • 6,090
  • 1
  • 15
  • 42
0

You can do this,

With CTE as
(                       
    SELECT  top 10 peopleName.[Id], peopleName.[LastName], peopleName.[FirstName]       
    FROM 
    [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
    WHERE EXISTS (SELECT 1 
                        FROM [dbo].[people_NAME] peopleName2 
                        WHERE peopleName2.[Id] != people.[id] 
                            AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')
    order by peopleName.[Id]                    
)

//here join CTE with people table if at all it is require

select * from CTE

IF joining with people is not require then no need of CTE.

KumarHarsh
  • 5,046
  • 1
  • 18
  • 22