MySQL / PHP: Find similar / related items by tag / taxonomy

Question

I have a cities table which looks like this.

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|

I have a tags table which looks like this.

|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |

and a cities_tags table:

|id| city_id | tag_id |
|1 | 1       | 1      | 
|2 | 1       | 3      | 
|3 | 2       | 1      |
|4 | 2       | 3      | 
|5 | 3       | 2      |     
|6 | 3       | 3      |

How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3)

I have found the Jaccard index but I'm unsure as how best to implement this.

why not start with something simple first like total no. of tags on which the cities match and then find the closest cities based on no. of matching tags ? — Maximus2012, Aug 02 '13 at 15:11
Is this all the data you have ? Are you allowed to add more columns to the database, like latitude/longitude values ? Or would you prefer a server side / client side API call every time you want to know this ? — Sliq, Aug 07 '13 at 20:08
What have the tags to do with the distance calculation ? And why is there a "river" tag ? — Sliq, Aug 07 '13 at 20:10
@Panique: The name of the tag should not matter. Could've been AAA, BBB & CCC fot this example — Martijn, Aug 07 '13 at 20:30
@Panique: I have no idea why there is a River tag, it's an example of UGC. Yes I can add additional fields to the DB if required. — Tom, Aug 07 '13 at 21:46
@AaronMiller that's a good question. Tags in common could be one way of doing it, I'm not sure which approach will yield the best results. — Tom, Aug 07 '13 at 21:50
Why does the cities_tags table have 4 entries with the same id? Isn't id supposed to be a unique identifier here? is it a typo? — nl-x, Aug 07 '13 at 22:42
@nl-x sorry, my bad. I just replicated the data here to simplify the example. Fixed now — Tom, Aug 07 '13 at 23:50
@Tom See my updated http://sqlfiddle.com/#!2/e7456/1 Jaccard similarity fiddle — M Khalid Junaid, Aug 11 '13 at 09:59

M Khalid Junaid · Accepted Answer · 2013-08-11T09:57:42.917

You question about How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3) and based on your provided data set there is only one thing to relate that is the common tags between the cities so the cities which shares the common tags would be the closest one below is the subquery which finds the cities (other than which is provided to find its closest cities) that shares the common tags

SELECT * FROM `cities`  WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )

Working

I assume you will input one of the city id or name to find their closest one in my case "Paris" has the id one

 SELECT tag_id FROM `cities_tags` WHERE city_id=1

It will find all the tags id which paris has then

SELECT city_id FROM `cities_tags` WHERE tag_id IN (
    SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )

It will fetch all the cities except paris that has the some same tags that paris also has

Here is your Fiddle

While reading about the Jaccard similarity/index found some stuff to understand about the what actualy the terms is lets take this example we have two sets A & B

Set A={A, B, C, D, E}

Set B={I, H, G, F, E, D}

Formula to calculate the jaccard similarity is JS=(A intersect B)/(A union B)

A intersect B = {D,E}= 2

A union B ={A, B, C, D, E,I, H, G, F} =9

JS=2/9 =0.2222222222222222

Now move towards your scenario

Paris has the tag_ids 1,3 so we make the set of this and call our Set P ={Europe,River}

London has the tag_ids 1,3 so we make the set of this and call our Set L ={Europe,River}

New York has the tag_ids 2,3 so we make the set of this and call our Set NW ={North America,River}

Calculting the JS Paris with London JSPL = P intersect L / P union L , JSPL = 2/2 = 1

Calculting the JS Paris with New York JSPNW = P intersect NW / P union NW ,JSPNW = 1/3 = 0.3333333333

Here is the query so far which calcluates the perfect jaccard index you can see the below fiddle example

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC

In above query i have the i have derived the result set to two subselects in order get my custom calculated aliases

enter image description here

You can add the filter in above query not to calculate the similarity with itself

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE  cities.`id` !=1
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC

So the result shows Paris is closely related to London and then related to New York

Jaccard Similarity Fiddle

Interesting implementation, I'd be curious to see how this scales against my implementation with a large dataset. — Travis Hegner, Aug 12 '13 at 14:45
@TheGunner Some caching should be useful, given that your tags don't change often. — Dzhuneyt, Aug 12 '13 at 15:17
@TheGunner If you notice the scenario and map on real life , its based on cities and countries so this dataset won't be huge to compare and no reason of curious against the performance there are many solutions available for the optimization in RDBMS and OP should have to care about the caching, proper indexing, relations and datatypes for the columns — M Khalid Junaid, Aug 12 '13 at 17:52
I've just tried this on my real dataset of 1,277 cities and 1,809 tags. Works like a dream. Thanks kindly. — Tom, Aug 13 '13 at 13:08
One question. If I wanted to limit the output to the top 10, where should I insert this into the query? — Tom, Aug 13 '13 at 13:09
You have to add the `LIMIT` in the top most parent of all derived tables like in the end of the query `LIMIT 10` — M Khalid Junaid, Aug 13 '13 at 13:21
How exactly do `q.sets + q.parisset` and `q.sets - q.parisset` work? Aren't they casting comma-separated strings to integers and returning their sum? — Stas Bichenko, Jan 05 '14 at 00:00
So I'm pretty sure this solution can't possibly work. I made a Fiddle with different data: http://sqlfiddle.com/#!2/ad2a9/1. I've changed `tag_id`s by multiplying them by changing `1` to `2`, `3` to `8`, `2` to `5`. The result is different, even though it should be the same (seeing as city-tag relationship remained unchanged). This is because during `q.sets - q.parisset` and `q.sets + `q.parisset` `sets` and `parissets` are cast to integers (so only the part before first comma is kept: `2, 2` and `5, 8`). The fact that the original Fiddle works is a coincidence. This is NOT a working answer. — Stas Bichenko, Jan 05 '14 at 15:27
This solution doesn´t work! If you change **New York** only to tag **2**, the result of jaccard index must be 0 similarity, but is returning 0.3333, see [SQL Fiddle](http://sqlfiddle.com/#!9/00251/1/0) — Charles Cavalcante, Apr 24 '16 at 12:05
how `q.sets - q.parisset` equals to `intersect` ? in your result, #1 and #2, intersect of those must be 2 not 0. I think this answer is wrong — Pars, Jan 17 '20 at 08:54

Travis Hegner · Answer 2 · 2013-08-08T13:13:51.430

select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c 
inner join 
  (
  select city_id, count(*) as val 
  from cities_tags 
  where tag_id in (select tag_id from cities_tags where city_id=1) 
  and not city_id in (1)
  group by city_id
  ) as cnt 
on c.id=cnt.city_id
order by jaccard_index desc

This query is statically referring to city_id=1, so you'll have to make that a variable in both the where tag_id in clause, and the not city_id in clause.

If I understood the Jaccard index properly, then it also returns that value ordered by the 'most closely related'. The results in our example look like this:

|name      |jaccard_index  |
|London    |0.6667         |
|New York  |0.3333         |

Edit

With a better understanding of how to implement Jaccard Index:

After reading a bit more on wikipedia about the Jaccard Index, I've come up with a better way implement a query for our example dataset. Essentially, we will be comparing our chosen city with each other city in the list independently, and using the count of common tags divided by the count of distinct total tags selected between the two cities.

select c.name, 
  case -- when this city's tags are a subset of the chosen city's tags
    when not_in.cnt is null 
  then -- then the union count is the chosen city's tag count
    intersection.cnt/(select count(tag_id) from cities_tags where city_id=1) 
  else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
    intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1)) 
  end as jaccard_index
  -- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c 
inner join 
  (
    --  select the count of tags for each city that match our chosen city
    select city_id, count(*) as cnt 
    from cities_tags 
    where tag_id in (select tag_id from cities_tags where city_id=1) 
    and city_id!=1
    group by city_id
  ) as intersection
on c.id=intersection.city_id
left join
  (
    -- select the count of tags for each city that are not in our chosen city's tag list
    select city_id, count(tag_id) as cnt
    from cities_tags
    where city_id!=1
    and not tag_id in (select tag_id from cities_tags where city_id=1)
    group by city_id
  ) as not_in
on c.id=not_in.city_id
order by jaccard_index desc

The query is a bit lengthy, and I don't know how well it will scale, but it does implement a true Jaccard Index, as requested in the question. Here are the results with the new query:

+----------+---------------+
| name     | jaccard_index |
+----------+---------------+
| London   |        1.0000 |
| New York |        0.3333 |
+----------+---------------+

Edited again to add comments to the query, and take into account when the current city's tags are a subset of the chosen city's tags

My Jaccard Index is not correct. I will edit sometime today with a correct implementation. — Travis Hegner, Aug 07 '13 at 22:19
Please see the new query which implements a true Jaccard Index. — Travis Hegner, Aug 08 '13 at 01:49

score 4 · Answer 3 · edited Jun 20 '20 at 09:12

Too late, but I think that none of answers are fully correct. I got the best part of each one and put all together to make my own answer:

The Jaccard Index explanaiton of @m-khalid-junaid is very interesting and correct, but the implementation of (q.sets + q.parisset) AS union and (q.sets - q.parisset) AS intersect is very wrong.
The version of @n-lx is the way, but needs the Jaccard Index, this is very important, if a city have 2 tags and matches two tags of another city with 3 tags, the result will be the same of the matches on another city with only the same two tags. I think the full matches is most related.

My answer:

cities table like this.

| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |

cities_tag table like this.

| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |

With this sample data, Florence have a full matches with Paris, New York matches one tag, São Paulo have no tags matches and London matches two tags and have another one. I think the Jaccard Index of this sample is:

Florence: 1.000 (2/2)

London: 0.666 (2/3)

New York: 0.333 (1/3)

São Paulo: 0.000 (0/3)

My query is like this:

select jaccard.city, 
       jaccard.intersect, 
       jaccard.union, 
       jaccard.intersect/jaccard.union as 'jaccard index'
from 
(select
    c2.name as city
    ,count(ct2.tag_id) as 'intersect' 
    ,(select count(distinct ct3.tag_id) 
      from cities_tags ct3 
      where ct3.city_id in(c1.id, c2.id)) as 'union'
from
    cities as c1
    inner join cities as c2 on c1.id != c2.id
    left join cities_tags as ct1 on ct1.city_id = c1.id
    left join cities_tags as ct2 on ct2.city_id = c2.id and ct1.tag_id = ct2.tag_id
where c1.id = 1
group by c1.id, c2.id) as jaccard
order by jaccard.intersect/jaccard.union desc

SQL Fidde

nl-x · Answer 4 · 2013-08-07T23:33:02.780

This query is without any fancy functions or even sub queries. It is fast. Just make sure cities.id, cities_tags.id, cities_tags.city_id and cities_tags.tag_id have an index.

The queries returns a result containing: city1, city2 and the count of how many tags city1 and city2 have in common.

select
    c1.name as city1
    ,c2.name as city2
    ,count(ct2.tag_id) as match_count
from
    cities as c1
    inner join cities as c2 on
        c1.id != c2.id              -- change != into > if you dont want duplicates
    left join cities_tags as ct1 on -- use inner join to filter cities with no match
        ct1.city_id = c1.id
    left join cities_tags as ct2 on -- use inner join to filter cities with no match
        ct2.city_id = c2.id
        and ct1.tag_id = ct2.tag_id
group by
    c1.id
    ,c2.id
order by
    c1.id
    ,match_count desc
    ,c2.id

Change != into > to avoid each city to be returned twice. Meaning a city will then no longer appears once in the first column as well as once in the second column.

Change the two left join into inner join if you don't want to see the city combinations that have no tag matches.

It will return duplicates and also the with the city name to be matched — M Khalid Junaid, Aug 07 '13 at 22:11
@dianuj I have added a comment in the query to address the duplicates issue. (Change `!=` into `>`). And you are mistaking: the city names don't match. — nl-x, Aug 07 '13 at 23:01

score 1 · Answer 5 · answered Aug 07 '13 at 20:25

Could this be a push in the right direction?

SELECT cities.name, ( 
                    SELECT cities.id FROM cities
                    JOIN cities_tags ON cities.id=cities_tags.city_id
                    WHERE tags.id IN(
                                     SELECT cities_tags.tag_id
                                     FROM cites_tags
                                     WHERE cities_tags.city_id=cites.id
                                     )
                    GROUP BY cities.id
                    HAVING count(*) > 0
                    ) as matchCount 
FROM cities
HAVING matchCount >0

What I tried was this:

// Find the citynames:
Get city.names (SUBQUERY) as matchCount FROM cities WHERE matchCount >0

// the subquery:
select the amount of tags cities have which (SUBSUBQUERY) also has

// the subsubquery
select the id of the tags the original name has

MySQL / PHP: Find similar / related items by tag / taxonomy

5 Answers5

Working

Edit

My answer:

SQL Fidde

Linked