1

I have a MySQL database structure which has questions and tags, something like stackoverflow.

So my three tables are like this :

questions:
-----------------------
id | content | asked_on

tags:
--------
id | tag

question tags:
---------------
id | qid | tid

and I'm looking for a "pure-SQL" way to get related questions based on a given row from questions table.

By related questions, I mean I want to find the number of mutual tags.

And of course I can construct the SQL query in PHP, so I can do something like :

COUNT xyz ..... WHERE tags.id = 17 OR tags.id = 42 OR etc..

but in the end, I should have something like :

qid  number_of_mutual_tags
---  --------------------
42     4
12     3
25     3
...

Sample Case:

suppose the reference question has tags php,sql,mysql,javascript,html

the table above is constructed with these questions :

id   tags  // of course this is just a representation of the three MySQL tables
---  ----
42   php,sql,mysql,javascript,xyz
12   php,sql,mysql , abc
25   sql,mysql,javascript, ijk

of course it should have a LIMIT clause.

What Have I Tried?

Frankly, I cannot do anything since I'm not so good with SQL. I guess I have to do something with JOIN and COUNT commands but how ?

Thanks for any help !

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
jeff
  • 13,055
  • 29
  • 78
  • 136
  • So what have you tried? – Strawberry Feb 09 '14 at 22:51
  • @Strawberry unfortunately I couldn't try anything except searching on Google. I cannot figure out where to start.. – jeff Feb 09 '14 at 22:53
  • What do you mean by "number of mutual tags"? The number of times a tag is assigned to a question? The total number of tags assigned to a question? The number of unique tags assigned to a question? – scrowler Feb 09 '14 at 22:54
  • @scrowler I mean the number of tags that are assigned both to row A and row B. let me edit the question. – jeff Feb 09 '14 at 22:55

2 Answers2

4

Use a join to find the tags in common. Then aggregate to count them:

select qt.qid, count(qtx.tid) as number_of_mutual_tags
from questiontags qt left outer join
     questiontags qtx
     on qt.tid = qtx.tid and
        qtx.qid = XX 
group by qt.qid
order by number_of_mutual_tags desc;

If you want to exclude the original question, you can add the condition: qt.qid <> qtx.qid.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Wow ! I think this works. Let me test on the real database. And I'm amazed that this only requires the question id as the parameter (i.e., not the tags or tag id's). Either case: Thank you very much ! – jeff Feb 09 '14 at 23:02
  • left outer join is very bad from performance point of view, you can use inner, will work the same way – Iłya Bursov Feb 09 '14 at 23:08
  • @Lashane . . . Please provide some reference about the poor performance of `left outer join`, because my experience says otherwise. Second, a `left outer join` is needed to get the questions that have no matching tags. (Use an inner join if you only want questions that have at least one matching tag.) – Gordon Linoff Feb 09 '14 at 23:12
  • Oh, yes, I want only `number_of_mutual_tags > 0` I'm still trying to test the code, I have a bug in my PHP code now :) I will accept as soon as I see that it really works for me. Thanks again ! – jeff Feb 09 '14 at 23:15
  • @GordonLinoff http://stackoverflow.com/questions/2726657/inner-join-vs-left-join-performance-in-sql-server – Iłya Bursov Feb 09 '14 at 23:15
  • @Lashane . . . That question is about SQL Server. The question itself is on why a `left outer join` would be *faster* (as experienced by the OP). The selected answer is misleading. And if you read the comments, you'll see that even the author of the misleading answer agrees that the performance difference should be minor. This answer has been cited before and it is a shame because the answer is highly misleading. – Gordon Linoff Feb 09 '14 at 23:20
  • @GordonLinoff it does not matter mysql, sql server or any other db, inner join usually faster at least because of result size (less network IO), anyway you can test your query with both joins (just be sure to create big table for tests) – Iłya Bursov Feb 09 '14 at 23:27
  • Either way, in my case performance is not a big problem. But outer join caused a strange thing, some arrays (constructed from this query's result) were empty. Maybe qid was containing NULL in somewhere. But it works in phpMyAdmin.. I still couldn't figure it out. but inner join worked good. Thanks for all the answers ! – jeff Feb 09 '14 at 23:28
1

So, in your code you have some kind of array with tags of currently displayed question, let it be $array

so, you query to get related questions, ordered by count of mutual tags will be:

$query = 'select qid, count(*)
from questiontags
where tid in (' . implode(',', $array) . ')
group by qid
order by count(*) desc';

if you don't have this array, you can change query to

$query = 'select qid, count(*)
from questiontags
where tid in (select tid from questiontags where qid=' . $CURRENTQUESTIONID . ')
group by qid
order by count(*) desc';

and good point from @Gordon Linoff - add condition qid <> $CURRENTQUESTIONID

Iłya Bursov
  • 23,342
  • 4
  • 33
  • 57