Can I use string_split with enforcing combination of labels?

Question

So I have the following table:

Id    Name            Label
---------------------------------------
1     FirstTicket     bike|motorbike
2     SecondTicket    bike
3     ThirdTicket     e-bike|motorbike
4     FourthTicket    car|truck

I want to use string_split function to identify rows that have both bike and motorbike labels. So the desired output in my example will be just the first row:

Id    Name            Label
--------------------------------------
1     FirstTicket     bike|motorbike

Currently, I am using the following query but it is returning row 1,2 and 3. I only want the first. Is it possible?

SELECT Id, Name, Label FROM tickets
WHERE EXISTS (
        SELECT * FROM STRING_SPLIT(Label, '|')
        WHERE value IN ('bike', 'motorbike')
      )

Why are you storing data in a delimited format in the first place? — Thom A, Feb 12 '20 at 17:05
↑↑↑↑↑↑↑ Delimited data violates 1NF and causes nothing but anguish. — Sean Lange, Feb 12 '20 at 17:12
It is not my database unfortunately, I must work with it as it is. In this case I guess it will be better to use a separate table for storing labels, right? — Andrew O, Feb 12 '20 at 17:16
Yes moving those labels to another table would be a good approach. They should be rows instead of delimited values. — Sean Lange, Feb 12 '20 at 17:24

GMB · Answer 1 · 2020-02-12T17:10:34.000

1

You could just use string functions for this:

select t.*
from mytable t
where 
    '|' + label + '|' like '%|bike|%'
    and '|' + label + '|' like '%|motorbike|%'

I would expect this to be more efficient than other methods that split and aggregate.

Please note, however, that you should really consider fixing your data model. Instead of storing delimited lists, you should have a separated table to represent the relation between tickets and labels, with one row per ticket/label tuple. Storing delimited lists in database column is a well-know SQL antipattern, that should be avoided at all cost (hard to maintain, hard to query, hard to enforce data integrity, inefficicent, ...). You can have a look at this famous SO post for more on this topic.

edited Feb 12 '20 at 17:10

answered Feb 12 '20 at 17:04

GMB

216,147
25
84
135

Thanks for your response. I previously used **LIKE** operator but the query is turning really ugly when I need to match more than 5 labels. That's why I was curios about string_split. – Andrew O Feb 12 '20 at 17:26
Yep - this will be slow and ugly. Concatenation (especially in a filter clause such as WHERE, OR or HAVING) will absolutely destroy performance. Making things worse - it does so in a sinister way because it takes a deep, deep dive into the execution plan to understand that Concatenation is the killer. Jeff Moden talks about the issues with Concatenation in this article: https://www.sqlservercentral.com/articles/tally-oh-an-improved-sql-8k-%E2%80%9Ccsv-splitter%E2%80%9D-function – Alan Burstein Feb 12 '20 at 18:14

score 1 · Accepted Answer · answered Feb 12 '20 at 17:05

1

You can use APPLY & do aggregation :

SELECT t.id, t.FirstTicket, t.Label
FROM tickets t CROSS APPLY
     STRING_SPLIT(t.Label, '|') t1
WHERE t1.value IN ('bike', 'motorbike')
GROUP BY t.id, t.FirstTicket, t.Label
HAVING COUNT(DISTINCT t1.value) = 2;

However, this breaks the normalization rules you should have separate table tickets.

answered Feb 12 '20 at 17:05

Yogesh Sharma

49,870
5
26
52

Thank you! Your query does the job – Andrew O Feb 12 '20 at 17:38
@AndrewO Note that I added an updated answer which shows how to do this more efficiently. Not trying to steal Yogesh' thunder :) – Alan Burstein Feb 12 '20 at 18:10

score 1 · Answer 3 · answered Feb 12 '20 at 18:07

Yogesh beat me to it; my solution is similar but with a HUGE performance improvement worth pointing out. We'll start with this sample data:

SET NOCOUNT ON;
IF OBJECT_ID('tempdb..#tickets','U') IS NOT NULL DROP TABLE #tickets;
CREATE TABLE #tickets (Id INT, [Name] VARCHAR(50), Label VARCHAR(1000));
INSERT #tickets (Id, [Name], Label)
VALUES
(1,'FirstTicket' , 'bike|motorbike'),
(2,'SecondTicket', 'bike'),
(3,'ThirdTicket' , 'e-bike|motorbike'),
(4,'FourthTicket', 'car|truck'),
(5,'FifthTicket',  'motorbike|bike');

Now the original and much improved version:

-- Original
SELECT      t.id, t.[Name], t.Label
FROM        #tickets AS t 
CROSS APPLY STRING_SPLIT(t.Label, '|') t1
WHERE       t1.[value] IN ('bike', 'motorbike')
GROUP BY    t.id, t.[Name], t.Label
HAVING      COUNT(DISTINCT t1.[value]) = 2;

-- Improved Version Leveraging APPLY to avoid a sort
SELECT      t.Id, t.[Name], t.Label
FROM        #tickets AS t
CROSS APPLY
(
  SELECT 1
  FROM   STRING_SPLIT(t.Label,'|') AS split
  WHERE  split.[value] IN ('bike','motorbike')
  HAVING COUNT(*) = 2
) AS isMatch(TF);

Now the execution plans:

If you compare the costs: the "sortless" version is query 4.36 times faster than the original. In reality it's more because, with the first version, we're not just sorting, we are sorting three columns - an int and two (n)varchars. Because sorting costs are N * LOG(N), the original query gets exponentially slower the more rows you throw at it.

Thank you Alan, I really appreciate your in depth analysis. I vote up your answer but I only have 13 rep. — Andrew O, Feb 14 '20 at 10:15

Can I use string_split with enforcing combination of labels?

3 Answers3