How to deal with structured massive dataset?

Question

My problem is I normally do not have that large data. Methods I have used with some success on as small data sets choke on this dataset, maybe my ram is not enough for it. I need to deal with a large dataset in MYSQL. The tables as below, contains data as large as 10gb. I need to do some a specific analysis on it.

I have two tables, table1 is as following, id is primary key, table2_id is foreign key

id date                 aid table2_id message
1  2014-03-19 16:21:02  121   1       login from xxxx
2  2014-03-20 14:00:32  123   2       decrease budget
3  2014-03-20 18:00:32  121   2       increase budget
4  2014-03-21 16:21:02  121   3       login from xxxx
5  2014-03-21 16:21:02  121   3       login from xxxx
6  2014-03-22 10:21:02  121   2       increase budget

table2 is a type table

id type
1  login
2  change_budget
3  search

I need to analyse the relationship between activity change_budget and other activity. This is waht I have tried so far:

I try to write SQL for each time one aid change_budget, count the all other actions of this aid in the preceding 24 hours. If one aid have a change_budget at 2014-03-14 16:21:02, I need all the actions performed on this aid from 2014-03-13 16:21:02 to 2014-03-14 16:21:02. Then, grouping by the action and count the result I want should look as follows.

aid login search change_budget
121 1     0      1
123 0     0      -1
121 0     2      1

-1 means change_budget is decrease, 1 means increase. Other number is how many time this action of this aid has happen within one day before this aid change_budget happens.

My problem is I normally do not have that large data. If I treat it in the same way as small data, my ram is not enough for it. So I need a big data solution. Any suggestion?

Note: I have edited this question in an attempt to explain the issue better, but it still will need review from the original author to verify it states the problem correctly.

@user2963623 obviously if the OP knew how to write the code he wouldn't be asking for help... — John Ruddell, Jul 10 '14 at 20:41
@user2963623 My problem is I normally do not have that large data. If I treat it in the same way as small data, my ram is not enough for it. So I need a big data solution. Any suggestion? — user3795071, Jul 10 '14 at 20:41
@user3795071 one thing you can do is make a small version of your table either on your localhost machine or even a sqlfiddle and then work with a smaller dataset... — John Ruddell, Jul 10 '14 at 20:43
Well big data requires time to process. Whenever I deal with such things, I keep my computer on for hours if not days! You might want to look at multithreading if that helps — user2963623, Jul 10 '14 at 20:44
The meaning of your question isn't entirely clear. Perhaps if you could give us a more complete example with a little more data it might be clearer? For example, what if there are some positive and some negative changes within the 24-hour period? — holdenweb, Jul 10 '14 at 20:48
@holdenweb record both of them. for the new table, each change_budget for each aid is a row. — user3795071, Jul 10 '14 at 20:57
So do you want a number of outputs for each budget change row, or for each 24-hour period? — holdenweb, Jul 10 '14 at 21:01
@holdenweb for each time, a change_budget happen for one aid, count the action of this aid within (date-1day,date),and count each action. — user3795071, Jul 10 '14 at 21:05
@user3795071 your data is incorrect. looking at what you posted and your expected results does not make sense and does not match up. 121 does not have a decrease and 123 does not have an increase? — John Ruddell, Jul 10 '14 at 21:16

score 1 · Accepted Answer · answered Jul 10 '14 at 22:55

OK so theres a few things about this question that just dont make sense... however with what you gave to work on this is what I worked up.

FIRST: your desired outcome is completely incorrect.. you say you want it by the day it happens but post a desired outcome that is completely different. I went out on a limb and assumed you want it by date since that was requested so with that in mind this is what you would have.

+-------------+---------+-----------+-----------+-----------+
|day occurred | aid     | logins    | searches  | budget    |
+-------------+---------+-----------+-----------+-----------+
|'2014-03-19' | 121     |   1       |   0       |   0       |
|'2014-03-20' | 123     |   0       |   0       |   -1      |
|'2014-03-20' | 121     |   0       |   0       |   1       |
|'2014-03-21' | 121     |   0       |   2       |   0       |
|'2014-03-22' | 121     |   0       |   0       |   1       |
+-------------+---------+-----------+-----------+-----------+

this is from the data you provided... and it is what has happened on each day for each AID.

SECOND: with this in mind you can achieve this query through a few ways. the best of which is making temporary tables to do the setup and then a simple query off of them.

SETUP:

CREATE TABLE logs
    (`id` int, `date` datetime, `aid` int, `t2_id` int, `message` varchar(55))
;

INSERT INTO logs
    (`id`, `date`, `aid`, `t2_id`, `message`)
VALUES
(1,  '2014-03-19 16:21:02',  121,   1,       'login from xxxx'),
(2,  '2014-03-20 14:00:32',  123,   2,       'decrease budget'),
(3,  '2014-03-20 18:00:32',  121,   2,       'increase budget'),
(4,  '2014-03-21 16:21:02',  121,   3,       'login from xxxx'),
(5,  '2014-03-21 16:21:02',  121,   3,       'login from xxxx'),
(6,  '2014-03-22 10:21:02',  121,   2,       'increase budget')
;


CREATE TABLE log_type
    (`id` int, `type` varchar(55))
;

INSERT INTO log_type
    (`id`, `type`)
VALUES
(1,  'login'),
(2,  'change_budget'),
(3,  'search');

I called the tables logs, and log_type as that is what seems to be happening here.. the tables are logging interactions.

TEMPORARY TABLES:

CREATE TEMPORARY TABLE t1
(
    SELECT
        l.aid,
        DATE(date) AS grouping_col,
        IF(
            lt.type = 'change_budget', 
                IF(l.message LIKE '%decrease%', -1, 1), 
                0
        ) AS changed_budget
    FROM logs l 
    JOIN log_type lt ON lt.id = l.t2_id
    GROUP BY grouping_col, changed_budget, aid
);

CREATE TEMPORARY TABLE t2
(   SELECT
        DATE(l.date) AS grouping_col,
        l.aid,
        IF(lt.type = 'login', COUNT(l.id), 0) AS logged_in
    FROM logs l 
    JOIN log_type lt ON lt.id = l.t2_id
    GROUP BY grouping_col,  aid
); 

CREATE TEMPORARY TABLE t3
 (
    SELECT
        DATE(l.date) AS grouping_col,
        l.aid,
        IF(lt.type = 'search', COUNT(l.id), 0) AS searched_for
    FROM logs l 
    JOIN log_type lt ON lt.id = l.t2_id
    GROUP BY grouping_col, aid
);

FINAL QUERY:

finally all you need to do is query off of these temporary tables to get your desired outcome

SELECT 
    t1.grouping_col as day_occurred, 
    t1.aid, 
    logged_in, 
    searched_for, 
    changed_budget 
FROM t1
JOIN t2 on t2.grouping_col = t1.grouping_col AND t2.aid = t1.aid
JOIN t3 on t3.grouping_col = t1.grouping_col AND t3.aid = t1.aid
;

RESULTS:

I set this up on my localhost database and ran it and the outcome is what it should be. IMAGE

score 0 · Answer 2 · answered Jul 10 '14 at 20:41

0

Look at dplyr. It's intended explicitly to allow you to work with database tables as if they were dataframes (subset, logical ops, etc.) and allows you to build out statements one after another in a SQL-esque way.

answered Jul 10 '14 at 20:41

Patrick McCarthy

2,478
2
24
40

My problem is I normally do not have that large data. If I treat it in the same way as small data, my ram is not enough for it. So I need a big data solution. Any suggestion? – user3795071 Jul 10 '14 at 20:43
Yes, you used those exact words in your question. Dplyr does a lot of the grouping and summation on the sql side by writing clever queries. I suspect it will work for you. Also, consider translating verbose text strings ("decrease budget") to single-byte numeric codes ("2"). – Patrick McCarthy Jul 10 '14 at 21:05

score 0 · Answer 3 · edited May 23 '17 at 10:30

The way I understand the problem is the following:

Take the date/time field of the dataset, extract the date
Group by the data based on the extracted date and aid
Analyze the text column to look for the word "increase/decrease" in case table2_id = 2
"pivot" the result, so the grouped data are in columns and not in rows

The following code should solve a part of the problem up to step 4 (might contain syntax errors, because I do not have mysql arround, and date/time functions are different in different DBMS):

select date("date"), aid, table2_id, count(message)
from table1
group by date("date"), aid, table2_id

Pivoting the result in MySQL seems to be answered in the following post: MySQL pivot table

Comments:

10GB is not really large. To compute this query, the DBMS would need to (most probably) group by (which is basically perform a sort over the first three columns and do one more scan over the data to do the actual grouping). Realistically assuming (in this case) that a record occupies 50 bytes, this means a table with 200M records => sorting costs are 19 scans over 10GB (which would be a rather pesimistic estimate), which means 190GB of data to scan and write. Assuming we can read/write 50MB/s, the whole query should take 3800 seconds (very pesimistic estimate). Realistically, I would assume some 10GB ÷ 50MB/s = 200 seconds execution time
The query seems to be a rather simple data selection/aggregate query. I'd recommend reading a book on SQL to get to know more about it. You might need to read only the first few dozen of pages to learn about it
https://dba.stackexchange.com/ seem to be a more approapriate place for these kind of questions. But taking the simplicity of the question and the answer not many people may find it interesting enough to answer it :)

My problem is I normally do not have that large data. If I treat it in the same way as small data, my ram is not enough for it. So I need a big data solution. Any suggestion?

DBMSes are smart systems. They check how much RAM you have and look for fast solutions to your queries. In your case, if the amount of RAM is M and M is a lot lower than 10GB, then DBMS will chop your data into N<M blocks of data, sort those blocks in RAM, save it to disk, and do the sort merge afterwards, resulting in a lot faster solution

I think he knows how he could normally do it, but this solution is to memory-intensive for him — Eknoes, Jul 10 '14 at 20:57
Thanks for your answer. The SQL I wrote make my computer dead...Maybe the reason is not the data is too large, just I can not write SQL. — user3795071, Jul 10 '14 at 21:30
you might want to create a smaller table (say some 10000 rows to begin with) to play with. Do something like: `create table smaller as select * from table1 where "data" < some_date` (or limit 10000 for instance) and then run your SQL queries over the smaller thing. That might "kill" your computer less. Once you are done, let your machine run over the large query. Keep in mind, that because of the I/Os your machine may be very slow in responding during the query — arthur, Jul 10 '14 at 21:38
Certainly the naive approach of joining the table to itself with a correlated subquery would be likely to make performance an issue, which may explain the reason your computer "died".. — holdenweb, Jul 11 '14 at 09:58

How to deal with structured massive dataset?

3 Answers3