Retrieving the highest-frequency value from a field in one table and updating it into another

Question

I have 2 MySQL tables with the following structure:

**tblLocations**
ID [primary key]
CITY [non-unique varchar]
NAME [non-unique varchar]
----------------------------------
**tblPopularNames**
ID [primary key]
CITY [unique varchar]
POPULARNAME [non-unique varchar]

I am receiving input from users through a web form and a PHP code is then inserting the data into tblLocations. This part is simple. Now, every time an insert is made to tblLocations, I need to trigger the following actions:

See if tblPopularNames contains an entry for the inserted CITY value
If the entry exists, update the corresponding POPULARNAME field with the highest-frequency NAME value against the CITY field in tblLocations.
If the entry doesn't exist, make one with the values just entered.

Can this be done without using any query nesting? What would be the least expensive way to perform this action in terms of memory usage?

I can see a related post here but the answers out there only provide the maximum count of the values being sought which isn't what I'm seeking to do. I need the least contrived way of accomplishing the two tasks. Also, I don't know exactly how the query would handle ties, i.e. two names enjoying the same frequency for the city entered. And I honestly don't mind the query returning either values in such a scenario as long as it doesn't throw an error.

I hope I have explained it as clearly as needed but should you have any doubts, feel free to comment.

P.S. Not sure if the question belongs here or over at DBA. I chose to go with SO because I saw other questions pertaining to queries on this site (e.g., this one). If one of the moderators feel DBA would be a better fit, request them to please move it as they deem appropriate.

Do you need to update the popularName field for the newly inserted tlbLocation only, or for all the tlbLocations with the same insertedCity? — LSerni, May 05 '15 at 21:34
The moment a new insert is made to tblEntry, it should update tblPopularNames for the entry corresponding to that CITY (or create one should one not exist already) with the NAME value with the highest occurrence against the CITY in question. I intend to run this query as a trigger. — TheLearner, May 05 '15 at 21:38
Well, the first insert to tblLocation is being called by PHP but the query I am asking for here that inserts to tblPopularNames will be a trigger within the MySQL database. — TheLearner, May 05 '15 at 21:47
It depends on your application, but it is important to understand that by using triggers you will catch 100% of the inserts, but you lose an element of control which may ultimately slow down your database in the future. — Phil, May 05 '15 at 21:50
Wow, is it? I was guessing triggers would be quicker than multiple PHP queries – clearly I was wrong. :( What alternative do you suggest? — TheLearner, May 05 '15 at 21:52
I have made a theoretical (code less) answer to explain my point. I wont deny that triggers can be really useful things, but I would personally prefer to put it at code level regardless of the situation. — Phil, May 05 '15 at 21:58

LSerni · Accepted Answer · 2015-05-06T13:33:58.417

The first table accepts two values from users: their name and the city they live in. The fields affected in that table are CITY and NAME. Then each time a new entry is made to this table, another is made to tblPopularNames with that city and the name that occurs most frequently against that city in tblLocations. For example, if John is the most popular name in NY, tblPopularNames gets updated with NY, John. –

Okay, so let's break this up into a trigger. each time a new entry is made translates to AFTER INSERT ON tblLocations FOR EACH ROW; the name that occurs most frequently against that city in tblLocations means we run a SELECT NEW.insertedCity, old.insertedName FROM tblLocations AS old WHERE insertedCity = NEW.insertedCity GROUP BY insertedName ORDER BY COUNT(*) DESC LIMIT 1; and we may want to add something to that ORDER BY to avoid several names at equal frequency get extracted at random.

There's an additional requirement, that if the city already exists in tblPopularNames the entry be updated. We need a UNIQUE KEY on tblPopularNames.popularCity for that; it will allow us to use ON DUPLICATE KEY UPDATE.

And finally:

DELIMITER //
CREATE TRIGGER setPopularName
    AFTER INSERT ON tblLocations
    FOR EACH ROW BEGIN
        INSERT INTO tblPopularNames 
        SELECT NEW.insertedCity, insertedName 
            FROM tblLocations
            WHERE insertedCity = NEW.insertedCity
            GROUP BY insertedName
            ORDER BY COUNT(*) DESC, insertedName
            LIMIT 1 
        ON DUPLICATE KEY
            UPDATE popularName = VALUES(popularName)
        ;
    END;//
DELIMITER ;

Test

mysql> INSERT INTO tblLocations VALUES ('Paris', 'Jean'), ('Paris', 'Pierre'), ('Paris', 'Jacques'), ('Paris', 'Jean'), ('Paris', 'Etienne');
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM tblPopularNames;
+-------------+-------------+
| popularCity | popularName |
+-------------+-------------+
| Paris       | Jean        |
+-------------+-------------+
1 row in set (0.00 sec)

mysql> INSERT INTO tblLocations VALUES ('Paris', 'Jacques'), ('Paris', 'Jacques'), ('Paris', 'Etienne');                                 Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM tblPopularNames;
+-------------+-------------+
| popularCity | popularName |
+-------------+-------------+
| Paris       | Jacques     |
+-------------+-------------+
1 row in set (0.00 sec)

Triggers vs. code

There's no denying @Phil_1984's answer has lots and lots and lots of merit. Triggers have their uses but they aren't a silver bullet.

Moreover, at this stage it's possible that the design is still too early in its lifecycle for it to be worth the hassle of outsourcing the hard work to a trigger. What if, for example, you decide to go with the "counter" solution hinted at above? Or what if you decide to complicate the choice of the popularName?

There's little doubt that maintaining (which includes thoroughly field-testing) a trigger is much more expensive than the same thing done in code.

So what I'd really do is first to design a function or method with the purpose of receiving the insertedValues and doing some magic.

Then I'd emulate the trigger code with several queries in PHP, wrapped in a transaction. They would be the same queries that appear in the trigger, above.

Then I'd go on with the rest of the work, safe in the knowledge that this solution is working, if perhaps amenable to performance improvement.

If, much later, the design is convincing and gets committed, it will be very easy to modify the function to only run one INSERT query and avail itself of a trigger - that one, or the slightly modified one that has evolved in the meantime.

If the slightly modified has been taken over by creeping featurism and is not easily backported to a trigger, you need do nothing, and have lost nothing. Otherwise you've lost the time for the initial implementation (very little) and are now ready to profit.

So my answer would be: both :-)

Slightly different use case (per comments)

The thing is, the first query being performed by PHP is an indefinitely large one with potentially hundreds of entries being inserted at once. And I do need to update the second table every time a new entry is made to the first because by its very nature, the most popular name for a city can potentially change with every new entry, right? That's why I was considering a trigger since otherwise PHP would have to fire hundreds of queries simultaneously. What do you think?

The thing is: what should it happen between the first and the last INSERT of that large batch?

Are you using the popular name in that cycle?

If yes, then you have little choice: you need to examine the popularity table after each insert (not really; there's a workaround, if you're interested...).

If no, then you can do all the calculations at the end.

I.e., you have a long list of

 NY        John
 Berlin    Gottfried
 Roma      Mario
 Paris     Jean
 Berlin    Lukas
 NY        Peter
 Berlin    Eckhart

You can retrieve all the popular names (or all the popular names with cities in the list you're inserting) together with their frequency, and have them in an array of arrays:

 [
     [ NY,        John,    115 ],
     [ NY,        Alfred,  112 ],
     ...
 ]

Then from your list you "distill" the frequencies:

 NY        John       1
 NY        Peter      1
 Berlin    Gottfried  1
 Roma      Mario      1
 Paris     Jean       1
 Berlin    Lukas      1
 Berlin    Eckhart    1

and you add (you're still in PHP) the frequencies to the ones you retrieved. In this case for example NY,John would go from 115 to 116.

You can do both at the same time, by first getting the "distilled" frequency of the new inserts and then running a query:

 while ($tuple = $exec->fetch()) {
     // $tuple is [ NY, John, 115 ]
     // Is there a [ NY, John ] in our distilled array?
     $found = array_filter($distilled, function($item) use ($tuple) {
         return (($item[0] === $tuple[0]) && ($item[1] === $tuple[1]));
     }
     if (empty($found)) {
         // This is probably an error: the outer search returned Rome,
         // yet there is no Rome in the distilled values. So how comes
         // we included Rome in the outer search?
         continue;
         // But if the outer search had no WHERE, it's OK; just continue
     }
     $datum = array_pop($found);
     // if (!empty($found)) { another error. Should be only one. }

     // So we have New York with popular name John and frequency 115
     $tuple[2] += $datum[2];
     $newFrequency[] = $tuple;
}

You can then sort the array by city and frequency descending using e.g. uasort.

uasort($newFrequency, function($f1, $f2) {
    if ($f1[0] < $f2[0]) return -1;
    if ($f1[0] > $f2[0]) return 1;

    return $f2[2] - $f1[2];
});

Then you loop through the array

 $popularName = array();
 $oldCity     = null;
 foreach ($newFrequency as $row) {
     // $row = [ 'New York', 'John', 115 ]
     if ($oldCity != $row[0]) {
         // Given the sorting, this is the new maximum.
         $popularNames[] = array( $row[0], $row[1] );
         $oldCity = $row[0];
     }
 }

 // Now popularNames[] holds the new cities with the new popular name.
 // We can build a single query such as
 INSERT INTO tblPopularNames VALUES
     ( city1, name1 ),
     ( city2, name2 ),
     ...
     ( city3, name3 )
 ON DUPLICATE KEY
    UPDATE popularName = VALUES(popularName);

This would insert those cities for which there's no entry, or update the popularNames for those cities where there is.

Since you speak of use-case, well this is what I am trying to do and I'm sorry if I wasn't clear enough in my question. The first table accepts two values from users: their name and the city they live in. The fields affected in that table are CITY and NAME. Then each time a new entry is made to this table, another is made to tblPopularNames with that city and the name that occurs most frequently against that city in tblLocations. For example, if John is the most popular name in NY, tblPopularNames gets updated with NY, John. — TheLearner, May 05 '15 at 22:06
And if tblPopularNames doesn't already have an entry for NY, one should be created. — TheLearner, May 05 '15 at 22:07
I think this example is doing something completely different to my understanding of the problem. I dont think the op wants to alter the inserted data on the fly. — Phil, May 05 '15 at 22:09
@AmitSchandillia, I see. I had taken Name to be a city name. I'm deleting this answer as it's useless in its present form, but I'll probably resurrect it tomorrow with amended code (the principle is exactly the same). — LSerni, May 05 '15 at 22:18
Wow! Wish I could accept this answer a dozen times over! So exhaustive and detailed, I'm blown. Thank you so very much for taking the time. Speaking of performance and scalability, I am honestly considering throwing in a PHP hack to do what the trigger would, just as you and @Phil_1984_ suggested. On those lines, I have thought of this: What if I just add a new FREQUENCY column to tblLocations that defaults to 1 and gets incremented by 1 each time there's a duplicate on the corresponding CITY-NAME values (set as composite unique). — TheLearner, May 06 '15 at 20:12
Then the PHP hack in question could just run a single prepared-bind to update tblPopularNames with the phone numbers (still available to PHP in the original array and the corresponding names with the highest votes (max() value in the FREQUENCY column). Think this would be more efficient? Less efficient? — TheLearner, May 06 '15 at 20:13
@AmitSchandillia , you can use a frequencyCounter in tblLocations; just use SUM(frequencyCounter) where you would COUNT(*). No other change is required. If you also use a ON DUPLICATE KEY UPDATE frequency = frequency + 1, you need not worry about checking - just insert away and let MySQL take care of the rest. For the phone numbers, they would go together with popularName; you would use popularName, popularPhone where you now only have popularName. — LSerni, May 07 '15 at 06:28

Phil · Answer 2 · 2015-05-05T22:27:20.243

2

I believe this is a question of application logic over database logic. E.g. code vs triggers.

Since what you are really doing is a form of indexing for use specifically in your application, I would recommend that this logic lies somewhere at your application level(e.g. php). It should be:

simple (i would just do 2 queries. A select count and update.)
easy to maintain (use good database interface abstraction, e.g. 1 function)
only run when needed (using logic in that function)

How you approach that solution is the tricky part. E.g. You might think that it's best to just do the calculation on every insert, but it would be inefficient to do this on every insert if you are doing a batch of inserts for the same city.

I have had a very bad experience of using triggers for everything and having the database get slow. Granted it was in postgre (15 years ago before mysql triggers existed) and on a rather large database of about 500 tables. It's good because it catches 100% of inserts, but sometimes that's not what you want to do. You lose an element of control from the applications point of view by using triggers. You can end up slowing down your whole database with too many of these triggers. So that's an anti triggers perspective. It's that loss of control which is a deal breaker for me.

edited May 05 '15 at 22:27

answered May 05 '15 at 21:56

Phil

1,996
1
19
26

Voted up for the anti-trigger perspective...definitely a point worth pondering over. – TheLearner May 05 '15 at 22:00
1

I do concur on this probably being better implemented at the application level. I'm much less sanguine about triggers "slowing down" the database; triggers (working *inside* the database) are consistently *faster* (and more robust) than application "add ons", except maybe in carefully contrived cases, *unless the trigger design itself is flawed* (a database *can* be brought to its knees by a trigger-happy DB admin). – LSerni May 05 '15 at 22:03
The thing is, the first query being performed by PHP is an indefinitely large one with potentially hundreds of entries being inserted at once. And I do need to update the second table every time a new entry is made to the first because by its very nature, the most popular name for a city can potentially change with every new entry, right? That's why I was considering a trigger since otherwise PHP would have to fire hundreds of queries simultaneously. What do you think? – TheLearner May 05 '15 at 22:25
1

That is exactly a reason to use code over triggers. The pro of application code is that it should be able to tell when there's going to be lot of inserts happening at once so it could be intelligent enough to do the inserts first, then build the popular rows afterwards. DB triggers are just dumb and would update on every insert. – Phil May 05 '15 at 22:33

Retrieving the highest-frequency value from a field in one table and updating it into another

2 Answers2

Test

Triggers vs. code

Slightly different use case (per comments)