5

I have 2 tables that can be simplified to this structure:

Table 1:

+----+----------+---------------------+-------+
| id | descr_id |        date         | value |
+----+----------+---------------------+-------+
| 1  |        1 | 2013-09-20 16:39:06 |     1 |
+----+----------+---------------------+-------+
| 2  |        2 | 2013-09-20 16:44:06 |     1 |
+----+----------+---------------------+-------+
| 3  |        3 | 2013-09-20 16:49:06 |     5 |
+----+----------+---------------------+-------+
| 4  |        4 | 2013-09-20 16:44:06 |   894 |
+----+----------+---------------------+-------+

Table 2:

+----------+-------------+
| descr_id | description |
+----------+-------------+
|       1  | abc         |
+----------+-------------+
|       2  | abc         |
+----------+-------------+
|       3  | abc         |
+----------+-------------+
|       4  | DEF         |
+----------+-------------+

I want to join the description into table1, filter by description so I only get rows where description=abc, and filter out "duplicate" rows, where two rows are duplicates if they have the same value and their dates are within 6 minutes of one another. My desired output table is below, (assuming abc is the desired description filter).

+----+----------+---------------------+-------+-------------+
| id | descr_id |        date         | value | description |
+----+----------+---------------------+-------+-------------+
| 1  |        1 | 2013-09-20 16:39:06 |     1 | abc         |
+----+----------+---------------------+-------+-------------+
| 3  |        3 | 2013-09-20 16:49:06 |     5 | abc         |
+----+----------+---------------------+-------+-------------+

The query I came up with is:

select * 
  from (
        select * 
          from table1 
          join table2 using(descr_id) 
         where label='abc'
       ) t1 
  left join (
        select * 
          from table1 
          join table2 using(descr_id) 
         where label='abc'
        ) t2 on( t1.date<t2.date and t1.date + interval 6 minute > t2.date) 
 where t1.value=t2.value.

Unfortunately this query takes over a minute to run with my dataset, and returns no results (although I believe there should be results). Is there a more efficient way to perform this query? Is there a way to name a derived table and reference it later in the same query? Also, why does my query return no results?

Thanks in advance for help!

edit: I would like to retain the first of several samples with close-together timestamps.

My table1 has 6.1 million rows, my table2 has 30K, which makes me realize that table2 would only have one row for description "abc". This means that I can just query for descr_id beforehand, then use that id to avoid joining table2 in the big query at all, making it much more efficient. However, if my table2 was setup as outlined above (which would be poor database design, I admit) what is a good way to perform such a query?

Eric Fitting
  • 160
  • 1
  • 7
  • 1
    Are you hoping to retain the first of several samples with close-together timestamps, or the last of them, or averaging their timestamps, or what? What timestamp should be in the resultset to represent each bunch of your samples are close together? – O. Jones Sep 25 '13 at 17:28
  • Nice question BTW +1 how many records does the tables have? – Raymond Nijland Sep 25 '13 at 18:41

2 Answers2

1

Try creating temporary tables and joining on the temporary tables:

CREATE TEMPORARY TABLE t1 AS (select * 
          FROM table1 
          JOIN table2 USING(descr_id) 
         WHERE label='abc')

CREATE TEMPORARY TABLE t2 AS (select * 
          FROM table1 
          JOIN table2 USING(descr_id) 
         WHERE label='abc')

SELECT *
FROM t1
LEFT JOIN t2 on( t1.date<t2.date and t1.date + interval 6 minute > t2.date) 
WHERE t1.value=t2.value

Temporary tables are automatically cleaned up after you disconnect from your database so there is no need to explicitly drop them.

I originally had this, but I don't believe it achieves the full requirements:

SELECT t1.id,
       t1.descr_id,
       t1.date,
       t1.value,
       t2.description
FROM table1 t1
JOIN table2 t2 ON t1.descr_id = t2.descr_id
WHERE t2.description = 'abc' 

This is essentially the same as the original query, however another option might be to create a view and join on the view like this:

CREATE VIEW v1 AS
SELECT * FROM table1 JOIN table2 USING(descr_id) WHERE label='abc'

CREATE VIEW v2 AS
SELECT * FROM table1 JOIN table2 USING(descr_id) WHERE label='abc'

SELECT *
FROM v1
LEFT JOIN v2 on( v1.date<v2.date and v1.date + interval 6 minute > v2.date) 
WHERE v1.value=v2.value

Also, if you run this query on a regular basis, you might consider loading the results from your first query into a staging table and doing your join on the staging table like this:

INSERT INTO staging
(SELECT * 
        FROM table1 
        JOIN table2 USING(descr_id) 
        WHERE label='abc')

SELECT *
    FROM staging s1
    LEFT JOIN staging s2 on( s1.date<s2.date and s1.date + interval 6 minute > s2.date) 
    WHERE s1.value=s2.value

TRUNCATE TABLE staging
Cameron Tinker
  • 9,634
  • 10
  • 46
  • 85
  • Please dont use this SQL anti pattern... very bad way to do this... because this can cause an myisam disk based table... – Raymond Nijland Sep 25 '13 at 17:46
  • Would dropping the tables after use be better practice? Or should temporary tables not be used at all? – Cameron Tinker Sep 25 '13 at 17:50
  • 1
    temporary table should be avoid it could result in the myisam disk see this http://dev.mysql.com/doc/refman/5.7/en/internal-temporary-tables.html it max_heap_table_size is to small it will happen – Raymond Nijland Sep 25 '13 at 18:40
0

try to use not exists something like select * from table1 t1 join table2 t2 using(descr_id) where label='abc' and not exists (select * from table1 t11 join table2 t22 using(descr_id) where label='abc' and t1.date < t11.date and t1.date + interval 6 minute > t11.date)

you may need to double check the (t1.date + interval 6 minute) syntax

akallali
  • 29
  • 5