2

Let's assume we have a database like this:

Project_tbl:

-----------------
id | Project_name
-----------------
1  | A 
2  | B
3  | C
-----------------

personel_project_tbl:

--------------------
user_id | Project_id
--------------------
1       | 1
2       | 2
3       | 1
3       | 2
2       | 3
--------------------

instrument_project_tbl:

--------------------------
instrument_id | Project_id
--------------------------
1             | 1
1             | 2
2             | 2
2             | 1
1             | 3
--------------------------

Now, I need to sort the list of projects and rank them with regard to their similarity to the project A.

For example:

A and B have 1 users in common over the 3 users and 2 instruments over the 2 instrument so their similarity ranking is (1/2 + 2/2) / 2 = 75%

A and C have no user in common but have 1 over 2 instruments so it will be (1/2)/2 = 25%

So B is more similar than be and output should be

--------------
Project | Rank
--------------
2       | 75
3       | 25

That's the first solution came to my mind...
If I did it in PHP and MySQL, it would be something like:

 for all tables as table_x
    for all projects (except A) as prj_y
         unique = (Select distinct count(items) from table_x where project is A)
         count += (Select distinct count(items) from table_x
                   where project is prj_x and items are in
                     (select distinct items from table_x where project is a)
                  )/unique

So the complexity would be O(n2) and with indexing the select also would cost O(log n) which wouldn't be affordable.

Do you have any idea to do it totally in MySQL or do it in a better and faster way?

******** More information and notes:**

  1. I'm limited to PHP and MySQL.

  2. This is just an example, in my real project the tables are more than 20 tables so the solution should have high performance.

  3. this question is the supplementary question for this one : Get the most repeated similar fields in MySQL database if yr solution can be used or applied in a way for both of them (somehow) It would be more than great. I want to multiply the value of related projects with the similarity of items to get the best option...

In conclusion, these two questions will : get the most related projects, get the similar items of all projects and find the most similar item for current project where the project is also similar to the current one! yo


Thanks for your intellectual answers, its really appreciated if you could shed some light on the situations

Community
  • 1
  • 1
Nil Null
  • 414
  • 5
  • 14
  • i dint get this calculation. A and B have 1 users in common over the 3 users and 2 instruments over the 2 instrument so their similarity ranking is (1/2 + 2/2) / 2 = 75% ... should it be (1/3+2/2)/2 =67% ? – Joe G Joseph Jul 23 '12 at 09:22
  • @JoeGJoseph over 2 users of project A – Nil Null Jul 23 '12 at 10:28

1 Answers1

0

You could do it this way:

SET @Aid = (SELECT id
            FROM Project_tbl
            WHERE Project_name = 'A');

SELECT P.id
  , (IFNULL(personel.prop, 0) +
     IFNULL(instrument.prop, 0)
    )/2*100 Rank
  , personel.prop AS personell
  , instrument.prop AS instrument
FROM Project_tbl P
LEFT JOIN
  ( SELECT B.Project_id pid, COUNT(*)/C.ref prop
    FROM personel_project_tbl A,
         personel_project_tbl B,
         (SELECT COUNT(*) AS ref
          FROM personel_project_tbl
          WHERE Project_id = @Aid
         ) AS C
    WHERE A.user_id = B.user_id
    AND A.Project_id = @Aid
    GROUP BY B.Project_id
  ) personel ON P.id = personel.pid
LEFT JOIN
  ( SELECT B.Project_id pid, COUNT(*)/C.ref prop
    FROM instrument_project_tbl A,
        instrument_project_tbl B,
         (SELECT COUNT(*) AS ref
          FROM instrument_project_tbl
          WHERE Project_id = @Aid
         ) AS C
    WHERE A.instrument_id = B.instrument_id
    AND A.Project_id = @Aid
    GROUP BY B.Project_id
  ) instrument ON P.id = instrument.pid
WHERE P.id <> @Aid
ORDER BY Rank DESC

The idea is to have one subquery for each table, and each of these subqueries maps project id to correspondence ratio for a given table.

I'm saying nothing at all about performance. You'll have to try and see whether it is fast enough for your needs, but as I see it there is no way to beat the O(n2) complexity you mention, as you have to inspect all the data.

MvG
  • 57,380
  • 22
  • 148
  • 276
  • Thanks bro, that's awesome. I could run it for a sample db and works fine, need to check with my actual db and will go back to u. Anyway, can u check my other question is well? this answer needs to be used for http://stackoverflow.com/questions/11538409/get-the-most-repeated-similar-fields-in-mysql-database... here I found the most similar projects and there I find the appropriate items regarding to rank of projects relations. Let me know if u need more details – Nil Null Jul 26 '12 at 03:15