I've got two tables (topics and terms) with a third table for my many-to-many relations between the two entities.
Each relation, called a bagging, has a source (text) and a weight (int between 0 and 100). A same pair (topic-term) can have several bagging (difference sources) and each has a different weight.
Now, when I query a topic to find out what are its best terms (more weight), I ideally want unique values with a calculated weight:
- a weight of 100 means this item is at the maximum
- several weight for the same pair (different sources) weigh more than single pairs
- there is no 'minus' weight
Here is the database schema:
| TOPIC
+-------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| label | varchar(255) | NO | UNI | NULL | |
| wtext | varchar(40) | YES | | NULL | |
+-------+------------------+------+-----+---------+----------------+
| TERM
+-------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| label | varchar(255) | NO | UNI | NULL | |
| slug | varchar(255) | NO | | NULL | |
+-------+---------------------+------+-----+---------+----------------+
| BAGGING
+----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| topic_id | int(11) unsigned | NO | MUL | NULL | |
| term_id | bigint(11) unsigned | NO | MUL | NULL | |
| weight | tinyint(1) unsigned | NO | | NULL | |
| source | varchar(8) | YES | | GEN | |
+----------+---------------------+------+-----+---------+----------------+
And this is my simple query:
SELECT
bagging.topic_id as topic_id,
topic.label as topic_label,
bagging.term_id as term_id,
term.label as term_label,
bagging.weight as weight,
bagging.source as source
FROM
bagging
JOIN term ON term.id = bagging.term_id
JOIN topic ON topic.id = bagging.topic_id
WHERE
bagging.topic_id = ( SELECT id FROM topic WHERE label = 'Altruism' )
ORDER BY
bagging.weight DESC
Which gives me the below results:
+----------+-------------+---------+-----------------------+--------+--------+
| topic_id | topic_label | term_id | term_label | weight | source |
+----------+-------------+---------+-----------------------+--------+--------+
| 8 | Altruism | 83 | Altruism | 100 | TOPIC |
+----------+-------------+---------+-----------------------+--------+--------+
| 8 | Altruism | 100 | Altruism (philosophy) | 95 | WPRD |
| 8 | Altruism | 100 | Altruism (philosophy) | 95 | MAN |
| 8 | Altruism | 84 | Truist | 95 | MAN |
| 8 | Altruism | 84 | Truist | 15 | WPRD |
+----------+-------------+---------+-----------------------+--------+--------+
| 8 | Altruism | 94 | Selfless action | 95 | WPRD |
| 8 | Altruism | 95 | Alturism | 95 | WPRD |
| 8 | Altruism | 96 | Digital altruism | 95 | WPRD |
| 8 | Altruism | 97 | Selflessly | 95 | WPRD |
| 8 | Altruism | 98 | Altruistical | 95 | WPRD |
| 8 | Altruism | 99 | Law of mutual aid | 95 | WPRD |
| 8 | Altruism | 101 | Altruistically | 95 | WPRD |
| 8 | Altruism | 85 | Altruistic | 95 | WPRD |
| 8 | Altruism | 86 | Altruist | 95 | WPRD |
| 8 | Altruism | 87 | Otherism | 95 | WPRD |
| 8 | Altruism | 88 | Unselfishness | 95 | WPRD |
| 8 | Altruism | 89 | Altruistic behavior | 95 | WPRD |
| 8 | Altruism | 90 | Altutrists | 95 | WPRD |
| 8 | Altruism | 91 | Altruists | 95 | WPRD |
| 8 | Altruism | 102 | Pathological altruism | 95 | WPRD |
+----------+-------------+---------+-----------------------+--------+--------+
Now, how to create a scoring function that would take the below into consideration for this particular example:
Altruism
is unbeatable, can only be equalled ( = 100 )Truist
should obviously be penalized by the15
/100 weight, yet the fact that there are two should also be factored, especially because the second is95
Altruist (Philosophy)
should weigh more than all the others (exceptAltruism
than can only be equaled.) even if the 95 twice looks bigger than 100.
The end result doesn't have to be scaled from 1 to 100, it can be a relative or abstract rating that takes into account these constraints.
I tried by calculating, for each row, ( term_sum_weight * 100 / topic_weight_sum_of_all_terms )
but see the below results, they aren't weighed enough.
How can I give a lot more weight for 95 to 96, than 1 to 20, for example?
The formula matters more than the language that will be used... MySQL or in Python/PHP in my programs.
EXPECTED RESULT (along those lines...)
+----------+-------------+---------+-----------------------+-------+--------+
| topic_id | topic_label | term_id | term_label | score | source |
+----------+-------------+---------+-----------------------+-------+--------+
| 8 | Altruism | 83 | Altruism | 1 | TOPIC |
+----------+-------------+---------+-----------------------+-------+--------+
| 8 | Altruism | 100 | Altruism (philosophy) | 0.98 | WPRD |
| 8 | Altruism | 84 | Truist | 0.96 | MAN |
+----------+-------------+---------+-----------------------+--------+-------+
| 8 | Altruism | 94 | Selfless action | 0.95 | MAN |
| 8 | Altruism | 95 | Alturism | 0.95 | MAN |
| 8 | Altruism | 96 | Digital altruism | 0.95 | MAN |
...........
| 8 | Altruism | 97 | Selflessly | 0.95 | MAN |
| 8 | Altruism | 90 | Altutrists | 0.95 | MAN |
| 8 | Altruism | 91 | Altruists | 0.95 | MAN |
| 8 | Altruism | 102 | Pathological altruism | 0.95 | MAN |
+----------+-------------+---------+-----------------------+--------+-------+