0

I have two tables:

table "Person"

ID          FirstName  LastName
----------- ---------- ----------
1           Janez      Novak
2           Matija     Špacapan
3           Francka    Joras

Table "UserList"

ID    FullName
----- --------------------
1     Andrej Novak
2     Novak Peter Janez
3     Jana Novak
4     Andrej Kosir
5     Jan Balon
6     Francka Joras
7     France Joras

As a result, the query must return those IDs from both tables, that FirstName and Lastname from table Person exist in table UserList. The name and Lastname must be precisely the same. FullName in table UserList can include the middle name - which should be "ignored".

Match: Janez Novak = Janez Novak OR Novak Janez OR Janez Peter Novak

Not a match: Janez Novak <> Janeza Novak OR Jjanez Novak

Wanted results:

ID   FirstName  LastName  ID   WholeName
---- ---------- --------- ---- -------------------
1    Janez      Novak     2    Novak Peter Janez
3    Francka    Joras     6    Francka Joras

This is my query:

SELECT 
    A.ID
    ,A.FirstName
    ,A.LastName
    ,B.ID
    ,B.WholeName
FROM    
    dbo.UserList B
    cross join dbo.Person A 
WHERE   
    (                                                
    CHARINDEX('"'+A.FirstName+'"', '"'+Replace(B.WholeName,' ','"')+'"') > 0
     AND CHARINDEX('"'+A.LastName+'"', '"'+Replace(B.WholeName,' ','"')+'"') > 0 
    )

The query works OK when there are not many records in the tables.

But my tables have: "Person" -> 400k and "UserList" -> 14k records.

Is my approach to finding a solution OK, or is there any other more efficient way to do that? Thank you.

BR

Tadej
  • 3
  • 3
  • 1
    Which dbms are you using? (This CHARINDEX function is unknown to me.) – jarlh May 23 '22 at 12:00
  • I use MS SQL Server 2017 – Tadej May 23 '22 at 12:31
  • welcome to stackoverflow tadej. have you considered using `LIKE`, something along the line of `UserList.FullName LIKE person.FirstName + '%' AND UserList.FullName LIKE '%' + person.LastName`? – Bagus Tesa May 23 '22 at 13:02
  • LIKE is not OK, because I need the exact name and last name in both tables. For example John in one table and Johnny in another table must not be a match. – Tadej May 23 '22 at 13:11
  • If you can redesign that Users table to include the ID of the Person table - you'll save yourself a world of hurt when those tables grow over time. – JonTout May 23 '22 at 13:32
  • I can not do that. Table "Person" is a part of our ERP system. Each month I get an excel file of thousands of persons with FullName and I have to find if any of our persons appear in this excel list. – Tadej May 23 '22 at 14:18
  • This kind of reminds me of the very first question I asked on SO. Maybe [the answer](https://stackoverflow.com/a/5501861/685760) @MartinSmith provided to my question could be a starting point for you. Not sure how efficient it would be, but it would help identify various individual names from both tables for you. – Mr Moose May 23 '22 at 15:12

3 Answers3

0

Your schema is broken :p

There are various heuristis for doing the matching, but I expect you'll be able to find counterexamples to break whatever you try. For example what about the four people: Peter Smith, Pete Smith, Peter Smithson, and Pete Smithson?

Here's a %LIKE% approach, which I'd expect to be slow.

SELECT p.ID, p.FirstName, p.LastName, u.ID, u.FullName,
    CASE WHEN COUNT(*) OVER (PARTITION BY p.ID) > 1 THEN 0 ELSE 1 END AS MatchIsUnique
FROM Person p
    INNER JOIN UserList u
        ON u.FullName LIKE p.FirstName + '%'
        AND u.LastName LIKE '%' + p.LastName

Here's a string manipulation approach based on the assumption that the space character is the delimiter.

SELECT p.ID, p.FirstName, p.LastName, u.ID, u.FullName,
    CASE WHEN COUNT(*) OVER (PARTITION BY p.ID) > 1 THEN 0 ELSE 1 END AS MatchIsUnique
FROM Person p
    INNER JOIN UserList u
        ON p.FirstName = SUBSTRING(@FullName, 0, CHARINDEX(' ', @Fullname))
        AND p.LastName = SUBSTRING(@FullName, LEN(@FullName) - CHARINDEX(' ', REVERSE(@Fullname))+1, CHARINDEX(' ', REVERSE(@Fullname)))

Probably also quite slow. Maybe you could speed it up by adding

  • SUBSTRING(@FullName, 0, CHARINDEX(' ', @Fullname)) and
  • SUBSTRING(@FullName, LEN(@FullName) - CHARINDEX(' ', REVERSE(@Fullname))+1, CHARINDEX(' ', REVERSE(@Fullname)))

as computed columns and indexing them.

Richard Barraclough
  • 2,625
  • 3
  • 36
  • 54
  • The first query returns only one match ("3-Francka-Joras", "1-Janez-Novak" is missing), and the second query returns me zero matching records. I've made two temp tables with only those attributes that I really need (the physical tables have much more attributes than I describe in my first post). After executing my query on temp tables it took around 20minutes, which is now acceptable. Any idea to speed up the execution is anyway more than welcome. :) – Tadej May 23 '22 at 13:35
  • Looks like `+1` should be `+2` as I'm getting an extra space at the start of the last name. – Richard Barraclough May 24 '22 at 09:14
0

Create tables

create table persons (
  id int IDENTITY(1,1) PRIMARY KEY,
  FirstName nvarchar(32) NOT NULL,
  LastName nvarchar(32) NOT NULL
);

create table users (
  id int IDENTITY(1,1) PRIMARY KEY,
  FullName nvarchar(32) NOT NULL
);

Sample data

INSERT INTO persons (FirstName, LastName)
values
('Janez','Novak'),
('Matija','Špacapan'),
('Francka','Joras');

INSERT INTO users (FullName)
VALUES
('Andrej Novak'),
('Novak Peter Janez'),
('Jana Novak'),
('Andrej Kosir'),
('Jan Balon'),
('Francka Joras'),
('France Joras'),

/* --EDIT: added sample data for wildcard testing-- */
('Franckas Joras'), -- added 's' after firstname
('Francka AJoras'), -- added 'A' before lastname
('Franckas AJoras'), -- both above
('Francka Jr. Joras'), -- added just midname
('Franckas Jr. Joras'); -- added 's' before firstname & added midname as well

Query (matching names)

SELECT p.id, p.FirstName, p.LastName, u.id as user_id, u.FullName
FROM persons p, users u
WHERE
  -- EDIT
  /* changed wildcards (added spaces on both sides)
  + added 2 more conditions without wildcards */
  u.FullName LIKE CONCAT(p.FirstName, ' % ', p.LastName)
  OR
  u.FullName LIKE CONCAT(p.LastName, ' % ', p.FirstName)
  OR
  u.FullName LIKE CONCAT(p.FirstName, ' ', p.LastName)
  OR
  u.FullName LIKE CONCAT(p.LastName, ' ', p.FirstName)

Output

SO-72348127

EDIT: output with new sample data (for wildcard testing) SO-72348127 (2)

Running example SQL Fiddle

Above example link is of MySQL & the code is working fine on SQL server

TalESid
  • 2,304
  • 1
  • 20
  • 41
  • LIKE is not OK, because I need the exact name and last name in both tables. For example, John in one table and Johnny in another table must not be a match. – Tadej May 23 '22 at 14:25
  • `LIKE` will work, but you just have to manipulate wildcards & add some new conditions. I've edited my answer with additional sample data & related query & result screenshot. – TalESid May 24 '22 at 07:13
  • 1
    You are correct. I got the same results and it is very fast. Thank you. – Tadej May 24 '22 at 08:02
  • glad I could be of your help. +1 the answer which worked for you please – TalESid May 24 '22 at 10:52
0

One method you could try is to split the full names into rows and then compare, selecting only those where both first and last name match:

select Max(m.id) Id, max(m.firstname) FirstName, Max(m.lastname) LastName, 
  u.id, Max(u.fullname) FullName
from userlist u
cross apply String_Split(fullname,' ')
cross apply (
    select *
    from person p
    where p.firstname = value or p.lastname = value
)m
group by u.id 
having Count(*)=2;

Output:

enter image description here

Stu
  • 30,392
  • 6
  • 14
  • 33