sql query for non-existing entries

Question

I have inherited a website and its corresponding database (SQL Server). The website uses stored procedures to pull data from the database. One of these stored procedures contains a pivot and it the pivot is taking over 4 hours to run. This is currently unacceptable. I am looking for help in replacing the pivot with standard SQL queries because I assume that will be faster and provide better performance.

Here is the pivot in question:

SELECT *
FROM (
    SELECT ac.AID
        ,ac.CatName AS t
        ,convert(INT, ac.Code) AS c
    FROM categories AS ac
    ) AS s
Pivot(Sum(c) FOR t IN (
            [tob]
            ,[ecit]
            ,[tobwcom]
            ,[rnorm]
            ,[raddict]
            ,[rpolicy]
            ,[ryouth]
            ,[rhealth]
            ,…
            )) AS p;

And the results of the pivot

|    AID    |    tob    |    ecit    |    tobwcom    |    rnorm    |
|-----------|-----------|------------|---------------|-------------|
|    1      |    1      |    NULL    |    NULL       |    0        |
|    2      |    1      |    NULL    |    NULL       |    1        |
|    3      |    1      |    NULL    |    NULL       |    0        |
|    4      |    1      |    NULL    |    NULL       |    0        |
|    5      |    1      |    NULL    |    NULL       |    0        |
|    6      |    1      |    NULL    |    NULL       |    1        |

Here’s the source table categories and some sample data:

CREATE TABLE categories(
   ArticleID   INTEGER  NOT NULL  
  ,ThemeID     INTEGER  NOT NULL
  ,ThemeName   VARCHAR(7) NOT NULL
  ,Code        BIT  NOT NULL
  ,CreatedTime VARCHAR(7) NOT NULL
);
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (1,1,'tob',1,'57:30.7');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (1,2,'ecig',1,'03:58.3');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (1,5,'rnorm',0,'42:56.5');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (2,1,'tob',1,'57:30.7');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (2,2,'ecig',0,'03:58.3');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (2,5,'rnorm',1,'42:56.5');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (2,6,'raddict',0,'42:59.8');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (3,1,'tob',1,'57:30.7');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (3,2,'ecig',0,'03:58.3');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (3,5,'rnorm',0,'42:56.5');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (21,1,'tob',1,'57:30.7');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (21,2,'ecig',0,'03:58.3');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (21,5,'rnorm',0,'42:56.5');
INSERT INTO categories(ArticleID,ThemeID,ThemeName,Code,CreatedTime) VALUES (21,6,'raddict',0,'42:59.8');

And here’s the table containing the category names – (mytable for now)

CREATE TABLE mytable(
   CatID       INTEGER  NOT NULL PRIMARY KEY 
  ,CatName     VARCHAR(7) NOT NULL
  ,CreatedTime DATETIME  NOT NULL
);
INSERT INTO mytable(CatID,CatName,CreatedTime) VALUES (1,'tob','2015-03-12 10:07:54.173');
INSERT INTO mytable(CatID,CatName,CreatedTime) VALUES (2,'ecig','2015-05-18 11:48:16.297');
INSERT INTO mytable(CatID,CatName,CreatedTime) VALUES (4,'tobwcom','2015-06-19 11:12:01.537');
INSERT INTO mytable(CatID,CatName,CreatedTime) VALUES (5,'rnorm','2015-06-22 14:24:02.317');
INSERT INTO mytable(CatID,CatName,CreatedTime) VALUES (6,'raddict','2015-06-22 14:24:13.957');
INSERT INTO mytable(CatID,CatName,CreatedTime) VALUES (7,'ecit','2015-06-22 14:26:18.437');

What I need is a way to perform the pivot’s ability to find the non-existing data in categories. The output would be something like:

|    AID    |    tob    |    ecit    |    tobwcom    |    rnorm    |
|-----------|-----------|------------|---------------|-------------|
|    1      |    1      |    NULL    |    NULL       |    0        |
|    2      |    1      |    NULL    |    NULL       |    1        |

Or the list of AIDs and the CatNames that don’t have any values. Such as:

| AID | CatName |
|-----|---------|
| 1   | ecit    |
| 1   | tobwcom |
| 2   | ecit    |
| 2   | tobwcom |

I have tried

select distinct(AID) FROM [categories]
 where [CatName] not in ( 'ecit', 'tobwcom')

but the results from this, the numbers don't seem to add up, however this could be an error on my part.

What is the actual question here? And it would help if you provided some details. See [this article](http://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum/). — Sean Lange, Dec 05 '18 at 19:17
@SeanLange my apologies. I have this pivot, and it takes over 4 hours to run (just for the pivot). The only reason for the pivot is to find the cases where a category hasn't been assigned to an A.I.D. (the null cases in the pivot result table). for example I'm wanting to find A.I.D.s that haven't been assigned a value for `ecit` or `tobwcom` which would give me A.I.D.s 1,2,3,4,5,6... The source table `categories` doesn't actually have any entries for these cases. so how would I find entries that don't exist? - I'll also update the question with this. — shadonar, Dec 05 '18 at 19:35
Well your update makes this even less clear. At first I thought this was a performance question. Now it seems like something else. What would really help is a clear explanation of what you are trying to accomplish. Posting the details of the tables would help too (like ddl so we know things like datatypes etc). Finally the desired output based on the sample data. — Sean Lange, Dec 05 '18 at 19:45
Is "categories" really a table? Or a view? Because in a normalized setup, the "AID" & "CatID" would be in 1 table . And the "CatName" in another, with "CatID" as the primary key. If it's a table then are you sure that each "CatID" only has 1 "CatName"? — LukStorms, Dec 05 '18 at 20:13
@LukStorms "categories" is a table. the AID is a foreign key from a different table. and yes each "CatID" only has 1 "CatName". again i inherited this and I can't really change the way the tables are structured. — shadonar, Dec 05 '18 at 20:34
@SeanLange I've re-wrote the question as per the suggested article you linked and attempted to clean it up. Please let me know if this is better — shadonar, Dec 05 '18 at 20:53
Looks like you need a dynamic pivot or a dynamic cross tab. Like this? https://stackoverflow.com/questions/10404348/sql-server-dynamic-pivot-query/10404455#10404455 — Sean Lange, Dec 05 '18 at 21:04
@SeanLange as long as it performs quickly it could work. just to point out, the `categories` table is currently about 1.608 billion rows, the total number of unique `AID` values is ~107 million, and the current number of `CatName`s is 78 (this is subject to change as the project continues). I'll try the dynamic pivot/cross tab and see if that has better performance. Thanks! — shadonar, Dec 05 '18 at 21:22
With 1.6 billion rows this is never going to be super fast. You are denormalizing your data here. — Sean Lange, Dec 05 '18 at 21:24
Here is a great article on doing a dynamic cross tab. http://www.sqlservercentral.com/articles/Crosstab/65048/ — Sean Lange, Dec 05 '18 at 21:24

LukStorms · Accepted Answer · 2018-12-07T11:48:38.947

1

Not sure if it would be fast enough for such a huge table. But for that second expected result then something this could help to find the missing.

select a.ArticleID, c.CatName
from #myarticles a
cross join #mycategories c
left join categories ca on (ca.ArticleID = a.ArticleID and ca.ThemeID = c.CatID)
where ca.ArticleID is null;

A test can be found here

Note that this method benefits from a combined primary key index on (ArticleID, ThemeID)

As an alternative, the LEFT JOIN with a NULL check can be changed to a NOT EXISTS.

select a.ArticleID, c.CatName
from #myarticles a
join #mycategories c on c.CatID between 1 and 7
where NOT EXISTS
(
   select 1
   from categories ca 
   where ca.ArticleID = a.ArticleID 
     and ca.ThemeID = c.CatID
);

edited Dec 07 '18 at 11:48

answered Dec 05 '18 at 22:58

LukStorms

28,916
5
31
45

would it be possible to do this without having the `#myarticles` table? – shadonar Dec 06 '18 at 16:36
If you don't already have an existing table with the unique article ID's then I suppose you could use a sub-query with DISTINCT or GROUP BY to replace #myarticles. The idea behind this query is to generate all possible combinations so you can link to the existing to find the missing. – LukStorms Dec 06 '18 at 17:16

sql query for non-existing entries

1 Answers1