How to compare if two strings contain the same words in SQL Server 2014?

Question

I'm trying to solve the same problem as in this question, but this time in SQL Server 2014. I need to check if strings are made out of the same words:

Returns true for:

Antoine de Saint-Exupéry = de Saint-Exupéry Antoine = Saint-Exupéry Antoine de = etc.

and

Returns false for:

Antoine de Saint-Exupéry != Antoine de Saint != Antoine Antoine de Saint-Exupéry != etc.

What are my options in SQL Server 2014? Is there a built-in function for such comparison?

no,you have to rollout your own – TheGameiswar Mar 23 '17 at 14:33 — TheGameiswar, Mar 23 '17 at 14:33
Split the string by space, compare sorted content. – Zohar Peled Mar 23 '17 at 14:55 — Zohar Peled, Mar 23 '17 at 14:55

score 2 · Answer 1 · answered Mar 23 '17 at 16:39

To compare 2 strings, one could ~~abuse~~ use the sorting capability in XQuery.

Cast the string to an XML, sort the elements and then return a string without the tags.

For example:

DECLARE @Words1 NVARCHAR(MAX) = N'Antoine de Saint-Exupéry';
DECLARE @Words2 NVARCHAR(MAX) = N'Saint-Exupéry Antoine de';

DECLARE @SortedWords1 NVARCHAR(MAX) = cast('<x>'+replace(@Words1,' ','</x><x>')+'</x>' as XML).query('for $x in /x order by $x ascending return $x').value('.','nvarchar(max)');
DECLARE @SortedWords2 NVARCHAR(MAX) = cast('<x>'+replace(@Words2,' ','</x><x>')+'</x>' as XML).query('for $x in /x order by $x ascending return $x').value('.','nvarchar(max)');

DECLARE @SameWords BIT = (case 
                          when @SortedWords1 = @SortedWords2
                          then 1 
                          else 0 
                          end);


SELECT @SameWords as SameWords;

Returns:

SameWords
---------
True

score 1 · Answer 2 · answered Mar 23 '17 at 14:54

Here is one way you could roll your own for this. I am using the string splitter from Jeff Moden. You can find the original article here. http://www.sqlservercentral.com/articles/Tally+Table/72993/. If you don't like that splitter there are some other great versions here. https://sqlperformance.com/2012/07/t-sql-queries/split-strings. I like the one from Jeff Moden because unlike any of the other splitters you get the ItemNumber returned which in some cases is incredibly useful.

CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE!  IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
     -- enough to cover VARCHAR(8000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT 1 UNION ALL
                 SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
                ),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                 SELECT s.N1,
                        ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)
                   FROM cteStart s
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = SUBSTRING(@pString, l.N1, l.L1)
   FROM cteLen l
;

The basic concept here is that you have to split your strings into words and then do a comparison. I used a couple of ctes so it is move obvious the process of how this works. The following works for all of the examples you posted.

declare @Phrase1 nvarchar(100) = 'Antoine de Saint-Exupéry'
    , @Phrase2 nvarchar(100) = 'de Saint-Exupéry Antoine'
;

with Phrase1 as
(
    select * 
    from DelimitedSplit8K(@Phrase1, ' ')
)
, Phrase2 as
(
    select * 
    from DelimitedSplit8K(@Phrase2, ' ')
)

select PhrasesEqual = convert(bit, case when count(*) > 0 then 1 else 0 end)
from Phrase1 p1
full outer join Phrase2 p2 on p2.Item = p1.Item
where p1.Item is null
    or p2.Item is null
;

How to compare if two strings contain the same words in SQL Server 2014?

2 Answers2