2

For a class project, a few others and I have decided to make a (very ugly) limited clone of StackOverflow. For this purpose, we're working on one query:

Home Page: List all the questions, their scores (calculated from votes), and the user corresponding to their first revision, and the number of answers, sorted in date-descending order according to the last action on the question (where an action is an answer, an edit of an answer, or an edit of the question).

Now, we've gotten the entire thing figured out, except for how to represent tags on questions. We're currently using a M-N mapping of tags to questions like this:

CREATE TABLE QuestionRevisions (
id INT IDENTITY NOT NULL,
question INT NOT NULL,
postDate DATETIME NOT NULL,
contents NTEXT NOT NULL,
creatingUser INT NOT NULL,
title NVARCHAR(200) NOT NULL,
PRIMARY KEY (id),
CONSTRAINT questionrev_fk_users FOREIGN KEY (creatingUser) REFERENCES
Users (id) ON DELETE CASCADE,
CONSTRAINT questionref_fk_questions FOREIGN KEY (question) REFERENCES
Questions (id) ON DELETE CASCADE
);

CREATE TABLE Tags (
id INT IDENTITY NOT NULL,
name NVARCHAR(45) NOT NULL,
PRIMARY KEY (id)
);

CREATE TABLE QuestionTags (
tag INT NOT NULL,
question INT NOT NULL,
PRIMARY KEY (tag, question),
CONSTRAINT qtags_fk_tags FOREIGN KEY (tag) REFERENCES Tags(id) ON
DELETE CASCADE,
CONSTRAINT qtags_fk_q FOREIGN KEY (question) REFERENCES Questions(id) ON
DELETE CASCADE
);

Now, for this query, if we just join to QuestionTags, then we'll get the questions and titles over and over and over again. If we don't, then we have an N query scenario, which is just as bad. Ideally, we'd have something where the result row would be:

+-------------+------------------+
| Other Stuff | Tags             |
+-------------+------------------+
| Blah Blah   | TagA, TagB, TagC |
+-------------+------------------+

Basically -- for each row in the JOIN, do a string join on the resulting tags.

Is there a built in function or similar which can accomplish this in T-SQL?

Bohemian
  • 412,405
  • 93
  • 575
  • 722
Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • do you need to resolve this only with tsql? – Sebastian Piu Dec 11 '11 at 20:22
  • @SebastianPiu: It would be really nice, yes. I don't want to have to install modules or anything on the SQL server, if that's what you mean. (If it can't be done I'll just fall over to slower methods... not like I really have to worry about lots of users for a system that won't ever see a production deployment).. – Billy ONeal Dec 11 '11 at 20:24
  • 1
    no, I didn't mean that exactly, I meant that this is probably best resolved at your dal layer (or whatever you have acting as a dal). You'd need 2 queries I think, One to get all the questions with the information, plus another one to get QuestionId, Tag pairs. – Sebastian Piu Dec 11 '11 at 20:26
  • Perhaps this answer is helpful for you? http://stackoverflow.com/a/6603735/569436 It should be if I understand correctly what you need. – Mikael Eriksson Dec 11 '11 at 21:35
  • This "merge rows into a string" type question is asked so many times I sometimes wonder if we need a sql-faq tag or something. – Bert Dec 17 '11 at 06:43
  • @Bert Evans: Maybe that means it's a function RDBMS vendors should add (As an aggregate function). – Billy ONeal Dec 17 '11 at 06:44
  • @BillyONeal Possibly, but, more pertinent to SO in general is the question of why the same question is asked and answered over and over. Why, for example, were you unable to find the answer to the question as it already exists on SO? It seems like a failure on one side or the other to make the already available answer easily found. – Bert Dec 17 '11 at 06:56
  • @BertEvans: Probably because there's not a good name for what's going on here. Everyone calls it something different. Since posting this I have indeed run into similar questions once or twice, but the questions themselves were written extremely differently. This is made even worse because the "string" version of the operation is called a join, which of course conflicts with the SQL definition of JOIN, which means the search box is almost worthless. I don't think there's any real way for the system to find such duplicates; case in point, none of the "similar questions" on the side answer this. – Billy ONeal Dec 17 '11 at 07:03

3 Answers3

2

Here's one possible solution using recursive CTE:

The methods used are explained here

TSQL to set up the test data (I'm using table variables):

DECLARE @QuestionRevisions TABLE  ( 
id INT IDENTITY NOT NULL, 
question INT NOT NULL, 
postDate DATETIME NOT NULL, 
contents NTEXT NOT NULL, 
creatingUser INT NOT NULL, 
title NVARCHAR(200) NOT NULL)

DECLARE @Tags TABLE ( 
id INT IDENTITY NOT NULL, 
name NVARCHAR(45) NOT NULL
)

DECLARE @QuestionTags TABLE ( 
tag INT NOT NULL, 
question INT NOT NULL
)
INSERT INTO @QuestionRevisions 
(question,postDate,contents,creatingUser,title)
VALUES
(1,GETDATE(),'Contents 1',1,'TITLE 1')

INSERT INTO @QuestionRevisions 
(question,postDate,contents,creatingUser,title)
VALUES
(2,GETDATE(),'Contents 2',2,'TITLE 2')

INSERT INTO @Tags (name) VALUES ('Tag 1')
INSERT INTO @Tags (name) VALUES ('Tag 2')
INSERT INTO @Tags (name) VALUES ('Tag 3')
INSERT INTO @Tags (name) VALUES ('Tag 4')
INSERT INTO @Tags (name) VALUES ('Tag 5')
INSERT INTO @Tags (name) VALUES ('Tag 6')

INSERT INTO @QuestionTags (tag,question) VALUES (1,1)
INSERT INTO @QuestionTags (tag,question) VALUES (3,1)
INSERT INTO @QuestionTags (tag,question) VALUES (5,1)
INSERT INTO @QuestionTags (tag,question) VALUES (4,2)
INSERT INTO @QuestionTags (tag,question) VALUES (2,2)

Here's the action part:

;WITH CTE ( id, taglist, tagid, [length] ) 
      AS (  SELECT question, CAST( '' AS VARCHAR(8000) ), 0, 0
            FROM @QuestionRevisions qr
            GROUP BY question
            UNION ALL
            SELECT qr.id
                ,  CAST(taglist + CASE WHEN [length] = 0 THEN '' ELSE ', ' END + t.name AS VARCHAR(8000) )
                ,  t.id
                ,  [length] + 1
            FROM CTE c 
            INNER JOIN @QuestionRevisions qr ON c.id = qr.question
            INNER JOIN @QuestionTags qt ON qr.question=qt.question
            INNER JOIN @Tags t ON t.id=qt.tag
            WHERE t.id > c.tagid )
SELECT id, taglist 
FROM ( SELECT id, taglist, RANK() OVER ( PARTITION BY id ORDER BY length DESC )
         FROM CTE ) D ( id, taglist, rank )
WHERE rank = 1;
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
Mack
  • 2,556
  • 1
  • 26
  • 44
  • This seems to work. I've decided to go with something subtly different, but I didn't specify that in my question and my solution doesn't work when the number of tags is greater than a certain number. I've therefore upvoted and checked this answer; though I'll post mine once I get the damn thing debugged. Thanks! – Billy ONeal Dec 17 '11 at 00:30
1

This was the solution I ended up settling on. I checkmarked Mack's answer because it works with arbitrary numbers of tags, and because it matches what I asked for in my question. I ended up though going with this, however, simply because I understand what this is doing, while I have no idea how Mack's works :)

WITH tagScans (qRevId, tagName, tagRank)
AS (
    SELECT DISTINCT
        QuestionTags.question AS qRevId,
        Tags.name AS tagName,
        ROW_NUMBER() OVER (PARTITION BY QuestionTags.question ORDER BY Tags.name) AS tagRank
    FROM QuestionTags
    INNER JOIN Tags ON Tags.id = QuestionTags.tag
)
SELECT
    Questions.id AS id,
    Questions.currentScore AS currentScore,
    answerCounts.number AS answerCount,
    latestRevUser.id AS latestRevUserId,
    latestRevUser.caseId AS lastRevUserCaseId,
    latestRevUser.currentScore AS lastRevUserScore,
    CreatingUsers.userId AS creationUserId,
    CreatingUsers.caseId AS creationUserCaseId,
    CreatingUsers.userScore AS creationUserScore,
    t1.tagName AS tagOne,
    t2.tagName AS tagTwo,
    t3.tagName AS tagThree,
    t4.tagName AS tagFour,
    t5.tagName AS tagFive
FROM Questions
INNER JOIN QuestionRevisions ON QuestionRevisions.question = Questions.id
INNER JOIN
(
    SELECT
        Questions.id AS questionId,
        MAX(QuestionRevisions.id) AS maxRevisionId
    FROM Questions
    INNER JOIN QuestionRevisions ON QuestionRevisions.question = Questions.id
    GROUP BY Questions.id
) AS LatestQuestionRevisions ON QuestionRevisions.id = LatestQuestionRevisions.maxRevisionId
INNER JOIN Users AS latestRevUser ON latestRevUser.id = QuestionRevisions.creatingUser
INNER JOIN
(
    SELECT
        QuestionRevisions.question AS questionId,
        Users.id AS userId,
        Users.caseId AS caseId,
        Users.currentScore AS userScore
    FROM Users
    INNER JOIN QuestionRevisions ON QuestionRevisions.creatingUser = Users.id
    INNER JOIN
    (
        SELECT
            MIN(QuestionRevisions.id) AS minQuestionRevisionId
        FROM Questions
        INNER JOIN QuestionRevisions ON QuestionRevisions.question = Questions.id
        GROUP BY Questions.id
    ) AS QuestionGroups ON QuestionGroups.minQuestionRevisionId = QuestionRevisions.id
) AS CreatingUsers ON CreatingUsers.questionId = Questions.id
INNER JOIN
(
    SELECT
        COUNT(*) AS number,
        Questions.id AS questionId
    FROM Questions
    INNER JOIN Answers ON Answers.question = Questions.id
    GROUP BY Questions.id
) AS answerCounts ON answerCounts.questionId = Questions.id
LEFT JOIN tagScans AS t1 ON t1.qRevId = QuestionRevisions.id AND t1.tagRank = 1
LEFT JOIN tagScans AS t2 ON t2.qRevId = QuestionRevisions.id AND t2.tagRank = 2
LEFT JOIN tagScans AS t3 ON t3.qRevId = QuestionRevisions.id AND t3.tagRank = 3
LEFT JOIN tagScans AS t4 ON t4.qRevId = QuestionRevisions.id AND t4.tagRank = 4
LEFT JOIN tagScans AS t5 ON t5.qRevId = QuestionRevisions.id AND t5.tagRank = 5
ORDER BY QuestionRevisions.postDate DESC
Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
1

This is a common question that comes up quite often phrased in a number of different ways (concatenate rows as string, merge rows as string, condense rows as string, combine rows as string, etc.). There are two generally accepted ways to handle combining an arbitrary number of rows into a single string in SQL Server.

The first, and usually the easiest, is to abuse XML Path combined with the STUFF function like so:

select rsQuestions.QuestionID,
       stuff((select ', '+ rsTags.TagName
              from @Tags rsTags  
              inner join @QuestionTags rsMap on rsMap.TagID = rsTags.TagID
              where rsMap.QuestionID = rsQuestions.QuestionID
              for xml path(''), type).value('.', 'nvarchar(max)'), 1, 1, '')
from @QuestionRevisions rsQuestions 

Here is a working example (borrowing some slightly modified setup from Mack). For your purposes you could store the results of that query in a common table expression, or in a subquery (I'll leave that as an exercise).

The second method is to use a recursive common table expression. Here is an annotated example of how that would work:

--NumberedTags establishes a ranked list of tags for each question.
--The key here is using row_number() or rank() partitioned by the particular question
;with NumberedTags (QuestionID, TagString, TagNum) as
(
    select  QuestionID,
            cast(TagName as nvarchar(max)) as TagString,
            row_number() over (partition by QuestionID order by rsTags.TagID) as TagNum
    from @QuestionTags rsMap
    inner join @Tags rsTags on rsTags.TagID = rsMap.TagID
),
--TagsAsString is the recursive query
TagsAsString (QuestionID, TagString, TagNum) as
(
    --The first query in the common table expression establishes the anchor for the 
    --recursive query, in this case selecting the first tag for each question
    select  QuestionID,
            TagString,
            TagNum
    from NumberedTags 
    where TagNum = 1
        
    union all
    
    --The second query in the union performs the recursion by joining the 
    --anchor to the next tag, and so on...
    select  NumberedTags.QuestionID,
            TagsAsString.TagString + ', ' + NumberedTags.TagString,
            NumberedTags.TagNum
    from    NumberedTags
    inner join TagsAsString on TagsAsString.QuestionID = NumberedTags.QuestionID
                           and NumberedTags.TagNum = TagsAsString.TagNum + 1
)
--The result of the recursive query is a list of tag strings building up to the final
--string, of which we only want the last, so here we select the longest one which
--gives us the final result
select QuestionID, max(TagString) 
from TagsAsString                         
group by QuestionID

And here is a working version. Again, you could use the results in a common table expression or subquery to join against your other tables to get your ultimate result. Hopefully the annotations help you understand a little more how the recursive common table expression works (though the link in Macks answer also goes into some detail about the method).

There is, of course, another way to do it, which doesn't handle an arbitrary number of rows, which is to join against your table aliased multiple times, which is what you did in your answer.

Community
  • 1
  • 1
Bert
  • 80,741
  • 17
  • 199
  • 164