13

Introduction
in my project I'm saving FacebookPages and their like count, as well as the like count per country. I have a table for the FacebookPages, one for the languages, one for the correlation between the facebook page and the language (and counting the likes) and one table which saves this data as a history. What I want to do, is to get the page with the strongest increase in likes over a specific time period.

Data to work with

I'm stripping the non relevant information from the create queries.

Table containing all facebook pages

CREATE TABLE `pages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `facebook_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `facebook_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `facebook_likes` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data:

INSERT INTO `facebook_pages` (`id`, `facebook_id`, `facebook_name`, `facebook_likes`)
VALUES
    (1, '552825254796051', 'Mesut Özil', 28593755),
    (2, '134904013188254', 'Borussia Dortmund', 13213354),
    (3, '310111039010406', 'Marco Reus', 12799627);

Table containing all languages

CREATE TABLE `languages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language` varchar(5) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data

INSERT INTO `languages` (`id`, `language`)
VALUES
    (1, 'ID'),
    (2, 'TR'),
    (3, 'BR');

Table containing the correlation

CREATE TABLE `language_page_likes` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language_id` int(10) unsigned NOT NULL,
  `facebook_page_id` int(10) unsigned NOT NULL,
  `likes` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  // Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data

INSERT INTO `language_page_likes` (`id`, `language_id`, `facebook_page_id`)
VALUES
    (1, 1, 1),
    (2, 2, 1),
    (3, 3, 1),
    (47, 3, 2),
    (51, 1, 2),
    (53, 2, 2),
    (92, 3, 3),
    (95, 2, 3),
    (97, 1, 3);

Table containing the history

CREATE TABLE `language_page_likes_history` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `language_page_likes_id` int(10) unsigned NOT NULL,
  `likes` int(11) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  // Foreign key stuff
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Example data

INSERT INTO `language_page_likes_history` (`id`, `language_page_likes_id`, `likes`, `created_at`)
VALUES
    (1, 1, 3272484, '2015-09-11 08:40:23'),
    (132014, 1, 3272827, '2015-09-14 08:31:00'),
    (2, 2, 1581361, '2015-09-11 08:40:23'),
    (132015, 2, 1580392, '2015-09-14 08:31:00'),
    (3, 3, 1467090, '2015-09-11 08:40:23'),
    (132016, 3, 1467329, '2015-09-14 08:31:00'),
    (47, 47, 828736, '2015-09-11 08:40:23'),
    (132060, 47, 828971, '2015-09-14 08:31:00'),
    (51, 51, 602747, '2015-09-11 08:40:23'),
    (132064, 51, 603071, '2015-09-14 08:31:00'),
    (53, 53, 545484, '2015-09-11 08:40:23'),
    (132066, 53, 545092, '2015-09-14 08:31:00'),
    (92, 92, 916570, '2015-09-11 08:40:24'),
    (132105, 92, 917032, '2015-09-14 08:31:01'),
    (95, 95, 537382, '2015-09-11 08:40:24'),
    (132108, 95, 537395, '2015-09-14 08:31:01'),
    (97, 97, 419175, '2015-09-11 08:40:24'),
    (132110, 97, 419484, '2015-09-14 08:31:01');

As you can see, I got data for the 14th and 11th September. Now I want to get the site, with the biggest increase in likes. Before I have done it with a column called last_like_count, but the problem is, that I can't be dynamic in the date range. With a "normal" regression function I could be dynamic for every date range.

Solution finding
What I already managed to do, was to build all the relationships that are present

SELECT p.id, p.facebook_name, plh.likes, l.language FROM facebook_pages p
INNER JOIN language_page_likes pl ON pl.facebook_page_id = p.id
INNER JOIN language_page_likes_history plh ON plh.language_page_likes_id = pl.id
INNER JOIN languages l ON l.id = pl.language_id
WHERE pl.language_id = 5 OR pl.language_id = 46 OR pl.language_id = 68

With that query I'm getting every likecount in the history of the system for specific languages. But, how would I build regression analysis into that part?

I already found this link here

Identifying trend with SQL query

but my math and MySQL skills aren't high enough to translate the SQL into MySQL. Any help?

Community
  • 1
  • 1
Musterknabe
  • 5,763
  • 14
  • 61
  • 117
  • 1
    That's pretty straightforward if your RDBMS supports windowing functions (like in the link you posted), but unfortunately, MySQL lacks many features that other RDBMS have. The most sensible solution would be to switch to a RDBMS that supports windowing functions (e.g. PostgreSQL). – Frank Schmitt Sep 16 '15 at 11:07
  • 3
    Switching the RDBMS is the most sensible solution to solving a single data analysis question?? – rwking Sep 17 '15 at 13:23

2 Answers2

6

This might be, what you are looking for:

SELECT SUM((X-AVG_X)*(Y-AVG_Y)) / SUM((X-AVG_X)*(X-AVG_X)) AS Slope,
       PageId, LanguageId
FROM
(
SELECT Q0.Y, 
       Q0.X, 
       Q1.AVG_Y,
       Q1.AVG_X,
       Q1.PageId,
       Q1.LanguageId
FROM   (SELECT T0.likes AS Y,
               UNIX_TIMESTAMP(T0.created_at) AS X,
               T1.facebook_page_id AS PageId,
               T1.language_id AS LanguageId
        FROM   language_page_likes_history T0 INNER JOIN
               language_page_likes T1 ON 
               (T0.language_page_likes_id = T1.id)
        WHERE  T0.created_at > '2015-09-11 00:00:00' AND
               T0.created_at < '2015-09-15 00:00:00') Q0 INNER JOIN
       (SELECT AVG(T2.likes) AS AVG_Y,
               AVG(UNIX_TIMESTAMP(T2.created_at)) AS AVG_X,
               T3.facebook_page_id AS PageId,
               T3.language_id AS LanguageId
        FROM   language_page_likes_history T2 INNER JOIN
               language_page_likes T3 ON 
               (T2.language_page_likes_id = T3.id)
        WHERE  T2.created_at > '2015-09-11 00:00:00' AND
               T2.created_at < '2015-09-15 00:00:00'
        GROUP BY T3.facebook_page_id, T3.language_id) Q1
        ON (Q0.PageId = Q1.PageId) AND (Q0.LanguageId = Q1.LanguageId)
) Q2
GROUP BY PageId, LanguageId
ORDER BY Slope DESC

It returns the slope of the linear regression per Page and Language. The column Slope represents the number of likes per second. In your sample data the amount of likes decreases for two cases. I don't know why. The output should look like this. The SQL statement is tested and I checked two row calculations manually for correct output.

|           Slope | PageId | LanguageId |
|-----------------|--------|------------|
|  0.001786287345 |      3 |          3 |
|  0.001326183029 |      1 |          1 |
|  0.001252720995 |      2 |          1 |
|  0.001194724653 |      3 |          1 |
|  0.000924075055 |      1 |          3 |
|  0.000908609364 |      2 |          3 |
|  0.000050263497 |      3 |          2 |
| -0.001515637747 |      2 |          2 |
| -0.003746563717 |      1 |          2 |

There might be a problem, if there is no data in the tables. So maybe ISNULL-checks have to be added.


When you only want to know the absolute values it is simpler. You can take following statement:

SELECT PageId, LanguageId,
       (likes_last_in_period - likes_before_period) AS Likes
FROM
(SELECT T1.facebook_page_id AS PageId,
       T1.language_id AS LanguageId,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at < '2015-09-12 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at DESC LIMIT 1) likes_before_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at >= '2015-09-12 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at ASC LIMIT 1) likes_first_in_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at <= '2015-09-15 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at DESC LIMIT 1) likes_last_in_period,
       (SELECT likes 
        FROM   language_page_likes_history
        WHERE  created_at > '2015-09-15 00:00:00' AND
               language_page_likes_id = T1.id
        ORDER BY created_at ASC LIMIT 1) likes_after_period

        FROM   language_page_likes T1) Q0
ORDER BY Likes DESC

which has 4 sub-queries. Only two are needed which you have to choose. I have choosen to use the number of likes just before the period and the last number of likes which is in the period to calculate the difference. The result looks like this:

| PageId | LanguageId | Likes |
|--------|------------|-------|
|      3 |          3 |   462 |
|      1 |          1 |   343 |
|      2 |          1 |   324 |
|      3 |          1 |   309 |
|      1 |          3 |   239 |
|      2 |          3 |   235 |
|      3 |          2 |    13 |
|      2 |          2 |  -392 |
|      1 |          2 |  -969 |
user1027167
  • 4,320
  • 6
  • 33
  • 40
  • That's super awesome. 1 upvote is guaranteed for you. But I didn't want the likes per second, sorry if that wasn't clear. I want to see the increase of likes in a date time period. Example: How many likes has pageId 3 gotten in the last 7 days. (Decrease might be, because people of that country unliked, or switched languages?) – Musterknabe Sep 17 '15 at 07:32
  • Yes your question was not clear, you talked about regression analysis and you linked another answer which also implements a formuly for linear regression. To only get the absolute values it is much simpler. See my edits... – user1027167 Sep 17 '15 at 10:55
  • If I understood this correctly. This query now shows how many likes a page got/lost for a specific language between the two dates 12th September and 15th September, right? If yes, I'll accept your answer – Musterknabe Sep 18 '15 at 07:26
  • Yes thats right. The only thing you have to consider is that if you have no values for a long period this might not be so accurate. E.g. you have one like counter from 1st September with 100, next on 13th September with 1000, than my query (using "likes_last_in_period - likes_before_period") would say there were 900 likes between the dates 12th September and 15th September. But this will not be a big problem if you have lot of data. – user1027167 Sep 18 '15 at 07:53
  • This confuses me "a bit". Wouldn't it be correct that it said 900 likes? I mean, I have data, the one says the page got 100 likes, the next says I got 1000 likes. Isn't it logical that 900 is the result? Anyway, normally I have data for every day. – Musterknabe Sep 18 '15 at 07:56
  • For the example above the data says 900 likes from 1st to 13th september, this is the truth. When you query from 12th to 15th september without knowing the exact data and the return value is 900, than it seems that there where 900 likes in only 3 days. But the truth is, the 900 likes were made in 13 days. The query from e.g. 10th to 15th september also returns 900. I think this is no problem for you, but you should know about it. – user1027167 Sep 18 '15 at 08:24
2

This was what I could came up with right now. I can't properly test this query, because now I don't have the time to create these table structures in one of the web's sql test pages. But I think even if it doesn't work initially it can point you in the right direction.

select 
    id, 
    new_date,
    max(increase)
from (
select 
    dg.id, 
    dg.date new_date, 
    dg.sum - (select sum from dg where dg.date = date_format((date_sub(str_to_date(new_date, '%Y-%m-%d') 1 DAY), '%Y-%m-%d') increase
from (
select 
    language_pages_likes_id id,
    date_format(created_at, '%Y%-m$-%d') date,
    sum(likes) likes_sum
from
    language_page_likes_history lplh
group by
    language_page_likes_id,
    date_format(created_at, '%Y%-m$-%d')
) day_grouping dg
) calculate_increases

Hope it helps. Later, when I can I will further test and improve this query.

Nelson Teixeira
  • 6,297
  • 5
  • 36
  • 73