What is the best way to handle millions of rows inside the Visits table?

Question

According to this question, The answer is correct and made the queries better but does not solve the whole problem.

CREATE TABLE `USERS` (
 `ID` char(255) COLLATE utf8_unicode_ci NOT NULL,
 `NAME` char(255) COLLATE utf8_unicode_ci NOT NULL,
 PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

There are only 5 rows inside the USERS table.

ID	NAME
C9XzpOxWtuh893z1GFB2sD4BIko2	...
I2I7CZParyMatRKnf8NiByujQ0F3	...
EJ12BBKcjAr2I0h0TxKvP7uuHtEg	...
VgqUQRn3W6FWAutAnHRg2K3RTvVL	...
M7jwwsuUE156P5J9IAclIkeS4p3L	...

CREATE TABLE `VISITS` (
 `USER_ID` char(255) COLLATE utf8_unicode_ci NOT NULL,
 `VISITED_IN` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
 KEY `USER_ID` (`USER_ID`,`VISITED_IN`),
 CONSTRAINT `VISITS_ibfk_1` FOREIGN KEY (`USER_ID`) REFERENCES `USERS` (`ID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

The indexes inside the VISITS table:

Keyname	Type	Unique	Packed	Column	Cardinality	Collation	Null	Comment
USER_ID	BTREE	No	No	USER_ID VISITED_IN	3245 5283396	A A	No No

There are 5,740,266 rows inside the VISITS table:

C9XzpOxWtuh893z1GFB2sD4BIko2 = 4,359,264 profile visits
I2I7CZParyMatRKnf8NiByujQ0F3 = 1,237,286 profile visits
EJ12BBKcjAr2I0h0TxKvP7uuHtEg = 143,716 profile visits
VgqUQRn3W6FWAutAnHRg2K3RTvVL = 0 profile visits
M7jwwsuUE156P5J9IAclIkeS4p3L = 0 profile visits

The time is taken for queries: (Seconds will change according to the number of rows)

SELECT COUNT(*) FROM VISITS WHERE USER_ID = C9XzpOxWtuh893z1GFB2sD4BIko2

Before applying Rick James' answer, The query took between 90 to 105 seconds
After applying Rick James' answer, The query took between 55 to 65 seconds

SELECT COUNT(*) FROM VISITS WHERE USER_ID = I2I7CZParyMatRKnf8NiByujQ0F3

Before applying Rick James' answer, The query took between 90 to 105 seconds
After applying Rick James' answer, The query took between 20 to 30 seconds

SELECT COUNT(*) FROM VISITS WHERE USER_ID = EJ12BBKcjAr2I0h0TxKvP7uuHtEg

Before applying Rick James' answer, The query took between 90 to 105 seconds After applying Rick James' answer, The query took between 4 to 8 seconds

SELECT COUNT(*) FROM VISITS WHERE USER_ID = VgqUQRn3W6FWAutAnHRg2K3RTvVL

Before applying Rick James' answer, The query took between 90 to 105 seconds
After applying Rick James' answer, The query took between 1 to 3 seconds

SELECT COUNT(*) FROM VISITS WHERE USER_ID = M7jwwsuUE156P5J9IAclIkeS4p3L

Before applying Rick James' answer, The query took between 90 to 105 seconds
After applying Rick James' answer, The query took between 1 to 3 seconds

As you can see before applying the index, It was taken between 90 to 105 seconds to count the visits of a specific user even if the user has a few rows (visits).

After applying the index things became better but the problem is:

If I visit the C9XzpOxWtuh893z1GFB2sD4BIko2 profile, It will take between 55 to 65 seconds to get profile visits.
If I visit the I2I7CZParyMatRKnf8NiByujQ0F3 profile, It will take between 20 to 30 seconds to get profile visits.
Etc...

The user who has a few rows (visits) will be lucky because his profile will load faster.

I can ignore everything above and create a column inside the USERS table to count the user visits and increase it when catching a new visit without creating millions of rows but that will not be working with me because I allow the user to filter the visits like this:

Last 60 minutes
Last 24 hours
Last 7 days
Last 30 days
Last 6 months
Last 12 months
All-time

What should I do?

How about adding some [Index](https://www.liquidweb.com/kb/mysql-optimization-how-to-leverage-mysql-database-indexing/#:~:text=Indexing%20is%20a%20powerful%20structure,be%20used%20to%20enforce%20uniqueness.)? — Asgar, May 29 '22 at 04:45
[Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) [Why are images of text, code and mathematical expressions discouraged?](https://meta.stackexchange.com/q/320052/266284) — philipxy, May 29 '22 at 04:57
@Asgar But I used the index like the topic in your link and that helped me, But that not solved the whole problem. — Taha Sami, May 29 '22 at 04:57
Table's good, but if there's anything at that Q&A link needed to understand this post, it should be in this post with explanation of why. PS If a table is input data for a [mre]--obligatory for debug questions but extremely helpful for all code questions--it should be initialization code in a text block in columns. (For large tables give algorithmic initialization.) (Consider the latter here if you want people to experiment.) [ask] [Help] — philipxy, May 29 '22 at 05:19
[How to Optimize Queries in a Database - The Basics](https://stackoverflow.com/q/3191623/3404097) [What are your most common sql optimizations?](https://stackoverflow.com/q/1332778/3404097) [When and why are database joins expensive?](https://stackoverflow.com/q/173726/3404097) [Re SQL "performance".](https://stackoverflow.com/a/24196511/3404097) [Asking query performance questions](https://dba.meta.stackexchange.com/q/3034/43932) [Tips for asking a good SQL question](https://meta.stackoverflow.com/a/271056/3404097) — philipxy, May 29 '22 at 05:22
What exactly is the index? [mre] Your 1st sentence doesn't make sense, please edit to clarify. Again: Explain what the link has to do with this question. — philipxy, May 29 '22 at 06:11
The related question means I applied that answer and works well but that not solve the whole problem. — Taha Sami, May 29 '22 at 06:16

Geoff Kendall · Answer 1 · 2022-05-29T10:19:59.323

The problem is that you are evaluating, and continually re-evaluating, very large row counts that are actually part of history and can never change. You cannot count these rows every time, because that takes too long. You want to provide counts for:

Last 60 minutes

Last 24 hours

Last 7 days

Last 30 days

Last six months

All-time

You need four tables:

Table 1: A small, fast table holding the records of visits today and yesterday

Table 2: An even smaller, very fast table holding counts for the periods 'Day before yesterday ("D-2") to "D-7", field 'D2toD7', the period 'D8toD30', 'D31toD183' and 'D184andEarlier'

Table 3: A table holding the visit counts for each user on each day

Table 4: The very large and slow table you already have, with each visit logged against a timestamp

You can then get the 'Last 60 minutes' and 'Last 24 hours' counts by doing a direct query on Table 1, which will be very fast. ‘Last 7 days’ is the count of all records in Table 1 (for your user) plus the D2toD7 value (for your user) in Table 2. ‘Last 30 days’ is the count of all records in Table 1 (for your user) plus D2toD7, plus D8toD30. ‘Last six months’ is Table 1 plus D2toD7, plus D8toD30, plus D31toD183. ‘All-time’ is Table 1 plus D2toDy, plus D8toD30, plus D31toD183, plus D184andEarlier.

I’d be running php scripts to retrieve these values – there’s no need to try and do it all in one complex query. A few, even several, very quick hits on the database, collect up the numbers, return the result. The script will run in very much less than one second.

So, how do you keep the counts in Table 2 updated? This is where you need Table 3, which holds counts of visits by each user on each day. Create Table 3 and populate it with COUNT values for the data in your enormous table of all visits, GROUP BY User and Date, so you have the number of visits by each user on each day. You only need to create and populate Table 3 once. You now need a CRON job/script, or similar, to run once a day. This script will delete rows recording visits made the day before yesterday from Table 1. This script needs to:

Identify the counts of visits for each user the day before yesterday
Insert those counts in Table 3 with the ‘day before yesterday’ date.
Add the count values to the ‘D2toD7’ values for each user in Table 2.
Delete the 'day before yesterday' rows from Table 1.
Look up the value for (what just became) D8 for each user in Table 3. Decrement this value from the ‘D2 to D7’ value for each user.
For each of the ‘D8toD30’, ’D31toD183’ etc. fields, increment for the day that is now part of the time period, decrement as per the day that drops out of the time period. Using the values stored in Table 3.

Remember to keep a sense of proportion; a period of 183 days approximates to six months well enough for any real-world visit counting purpose.

Overview: you cannot count millions of rows quickly. Use the fact that these are historical figures that will never change. Because you have Table 1 for the up-to-the-minute counts, you only need to update the historic period counts once a day. Multiple (even dozens of) very, very fast queries will get you accurate results very quickly.

I appreciate your effort, Thank you. – Taha Sami May 29 '22 at 10:58 — Taha Sami, May 29 '22 at 10:58

isatsara · Accepted Answer · 2022-05-29T06:32:55.337

1

This not be the answer, but a suggestion.

If they do not require real-time data, Can't we run a scheduler and insert these into a summary table every x minutes. then we can access that summary table for your count.

Note: We can add a sync time column to your table if you need a time-wise login count. (Then your summery table also getting increased dynamically)

Table column ex:

PK_Column, user ID, Numb of visit, sync_time

We can use asynchronous (reactive) implementation for your front end. That mean, Data will load after some time, but the user never will experience that delay in his work.
create a summary table and every day at 12.00 AM run a job and put the user wise and date wise last visited summery into that table.

user_visit_Summary Table: PK_Column, User ID, Number_of_Visites, VISIT_Date

Note: Create indexes for User ID and the Date fields

When you're retrieving the data, you're going to access it by a DB function

Select count(*) +  (Select Number_of_Visites from VISITS 
where user_id = xxx were VISIT_Date <= ['DATE 12:00 AM' -1]   PK_Column desc limit 1)  as old_visits
where USER_ID = xxx and VISITED_IN > 'DATE 12:00 AM';

edited May 29 '22 at 06:32

answered May 29 '22 at 05:36

isatsara

2,065
1
11
11

Please look a the formatted version of your post before you publish. Please delete & flag obsolete comments. [Help] – philipxy May 29 '22 at 06:12
Okay, It's better to make profile visits show in real-time, But because there are no solutions, I'll do it without real-time as you said and I will tell the users the profile visits will be refreshed every 12 hours for example. – Taha Sami May 29 '22 at 06:20
But where should I store the filters for every user? (Last 60 minutes, Last 24 hours, etc...). What about store it inside USERS table as JSON type? – Taha Sami May 29 '22 at 06:20
JSON type is also a good solution. if you can manage a NoSQL type DB, then direct-write it to that, use AOP or RabitMQ type solution for that (if you are afraid about side effects while data insertion) . – isatsara May 29 '22 at 06:26
"But where should I store the filters for every user? (Last 60 minutes, Last 24 hours, etc...)" for this one, last 60, last 30 min, you have to access your VISIT table. if it is more than 24 hours of data, then access the summary table. For total (Summery + VISIT data). check my 3rd solution – isatsara May 29 '22 at 06:30
Okay thank you, I'll bounty you 50 points after 2 days. – Taha Sami May 29 '22 at 06:39
`SELECT COUNT((SELECT USER_ID FROM VISITS WHERE VISITED_IN >= DATE_SUB(NOW(), INTERVAL 60 MINUTE))) AS LAST_60_MINUTES, COUNT((SELECT USER_ID FROM VISITS WHERE VISITED_IN >= DATE_SUB(NOW(), INTERVAL 24 HOUR))) AS LAST_24_HOURS FROM VISITS WHERE USER_ID = 'I2I7CZParyMatRKnf8NiByujQ0F3'` That query confused me, I'm trying to get the rows in the last 60 minutes and 24 hours, What is the wrong I did? – Taha Sami May 29 '22 at 11:04
1

Are you going to create your summery table or what with this query? Can you elaborate more on your query requirement and your plan? So, it is easy for us to help you. Also please tell me, all the time filters that you provide for your users – isatsara May 29 '22 at 11:42
[Look at my question](https://stackoverflow.com/questions/72424048/how-can-i-count-the-visits-according-to-the-date) – Taha Sami May 29 '22 at 13:29
I'll try a specific scenario and test the performance, If was bad than I'll apply your scenario. The current scenario you can find it in the tail of the [question](https://stackoverflow.com/questions/72389408/why-do-the-queries-take-the-same-time-to-get-data-even-though-the-number-of-rows)., Thank you Satsara for your patience – Taha Sami May 29 '22 at 13:33
1

@luq891 - That SELECT is quite inefficient the way it is written. I'll have a revision in my Answer soon. – Rick James May 29 '22 at 19:33
The system says: You may award your bounty worth 50 reputation in 23 hours. – Taha Sami May 31 '22 at 08:05

Rick James · Answer 3 · 2022-05-29T19:48:36.663

For any query of a day or longer, use a Summary table.

That is, build and maintain a Summary table with 3 columns user_id, date, count; PRIMARY KEY(user_id, date) For "all time" and "last month", the query will be

SELECT CUM(count) FROM summary WHERE user_id='...';
SELECT CUM(count) FROM summary
    WHERE user_id='...'
      AND date >= CURDATE() - INTERVAL 1 MONTH

At midnight each night, roll the your current table up into one row per user in the summary table, then clear the table. This table will continue to be used for shorter timespans.

This achieves speed for every user for every time range.

But, there is a "bug". I am forcing "day"/"week"/etc to be midnight to midnight, and not allowing you to really says "the past 24 hours".

I suggest the following compromise for that "bug":

For long timespans, use the summary table, plus count today's hits from the other table.
For allowing "24 hours" to reach into yesterday, change the other table to reach back to yesterday morning. That is, purge only after 24 hours, not 1 calendar day.

To fetch all counters at once, do all the work in subqueries. There are two approaches, probably equally fast, but the result is either in rows or columns:

-- rows:
SELECT 'hour', COUNT(*) FROM recent ...
UNION ALL
SELECT '24 hr', COUNT(*) FROM recent ...
UNION ALL
SELECT 'month', SUM(count) FROM summary ...
UNION ALL
SELECT 'all', SUM(count) FROM summary ...
;

-- columns:
SELECT
    ( SELECT COUNT(*) FROM recent ... ) AS 'hour'.
    ( SELECT COUNT(*) FROM recent ... ) AS '24 hr',
    ( SELECT SUM(count) FROM summary ... ) AS 'last month'
    ( SELECT SUM(count) FROM summary ... ) AS 'all time'
;

The "..." is

WHERE user_id = '...'
  AND datetime >= ...  -- except for "all time"

There is an advantage in rolling the several queries into a single query (either way) -- This avoids multiple round trips to the server and multiple invocations of the Optimizer.

forpas provided another approach https://stackoverflow.com/a/72424133/1766831 but it needs to be adjusted to reach into two different tables.

Good evening Rick, Your scenario looks good and I'll understand it and apply it soon. I'll back to your answer and test it when I finish 90% of the project because I'm a little bit late and I should be faster of develop the app. I don't know how I can thank you, Thank you for the effort and time that you made. I'm very tired and need to sleep. Good night Rick. — Taha Sami, May 29 '22 at 20:26

What is the best way to handle millions of rows inside the Visits table?

3 Answers3

Linked