1

I have two tables

  • "visit" which basically stores each visit on a website
    | visitdate           | city     |
    ----------------------------------
    | 2014-12-01 00:00:02 | Paris    |
    | 2015-01-03 00:00:02 | Marseille|
  • "cityweather" that stores weather infos 3 times a day for a lot of cities
    | weatherdate           | city     | temp |
    -------------------------------------------
    | 2014-12-01 09:00:02   | Paris    | 20   |
    | 2014-12-01 09:00:02   | Marseille| 22   |

I precise that there can be cities in the table visit that are not in cityweather and vice versa and I need to only take cities that are common to both tables.

So my question is :

How can I SELECT for each visitdate the MAX(weatherdate) that is inferior to the visitdate ?

It should look like this :

    | visitdate           | city     | beforedate          |
    --------------------------------------------------------
    | 2014-12-01 00:00:02 | Paris    | 2014-11-30 21:00:00 |
    | 2015-01-03 15:07:26 | Marseille| 2015-01-03 09:00:00 |

I tried something like this :

SELECT t.city, t.visitdate, d.weatherdate as beforedate
    FROM visitsub as t
    JOIN cityweatherfrsub as d
    ON  d.weatherdate = 
        ( SELECT MAX(d.weatherdate)
            FROM cityweatherfrsub
            WHERE d.weatherdate <= t.visitdate AND d.city=t.city
        )
    AND d.city = t.city;

But the size of the tables make it impossible to compute it in a "reasonnable" time (10^14 steps):

    | id | select_type        | table       | type  | possible_keys         | key          | key_len | ref          | rows    | Extra                     |
    ---------------------------------------------------------------------------------------------------------------------------------------------------------
    | 1  | PRIMARY            | d           | ALL   | idx_city,Idx_citydate | NULL         | NULL    | NULL         | 1204305 | Using where               |
    | 1  | PRIMARY            | t           | ref   | Idxcity, Idxcitydate  | Idxcitydate  | 303     | meteo.d.city | 111     | Using where; Using index  |
    | 2  | DEPENDANT SUBQUERY | cityweather | index | NULL                  | Idx_date     | 6       | NULL         | 1204305 | Using where; Using index  |

I am now investigating the field of user-variable like @variable but I am very new to it and only wrote something that is not working Error Code: 1111. Invalid use of group function:

SET @j :=0;
SET @k :=0;
SET @l :=0;
SET @m :=0;
CREATE TABLE intermedweather
    SELECT @l as city, @k as visitdate, @j as beforedate
    FROM visitsub t
    JOIN cityweatherfrsub d
    WHERE (@j := d.weatherdate) <= (@k := t.visitdate) 
      AND (@l := d.city) = (@m := t.city) 
      AND  @j = MAX(d.weatherdate);

You can find here a similar post but it can't work for my problem

Community
  • 1
  • 1
Yohan Obadia
  • 2,552
  • 2
  • 24
  • 31

5 Answers5

0

Maybe something like this:

select
    V.*,
    (
        select
            MAX(weatherdate) 
        from Weather W
        where
            W.weatherdate < V.visitdate and
            W.city = V.city
    ) beforedate
from Visit V
where
    exists ( select 1 from Weather where V.city = W.city)
mxix
  • 3,539
  • 1
  • 16
  • 23
0

Try this:

 SELECT t.visitdate, t.city, max(d.weatherdate) beforedate
  FROM visit t inner JOIN cityweather d
  on t.city=d.city
  group by t.city,t.visitdate
Ormoz
  • 2,975
  • 10
  • 35
  • 50
  • Sorry but I already tried but the size of my tables (1M rows) and (6M rows) make it impossible to compute... – Yohan Obadia Aug 18 '15 at 09:33
  • @YohanObadia do you have an index on city in the two tables? – Ormoz Aug 18 '15 at 09:36
  • As I said to @mxix, I have 3 indexes on both tables : (city) (date) and (city, date). The problem is an index can't be applyed on a grouping function, therefore, after applying the Max it's like there is no index at all. – Yohan Obadia Aug 18 '15 at 09:38
  • @YohanObadia I changed the query. could you please run `explain` on the query and post the result? – Ormoz Aug 18 '15 at 09:45
  • This one is the same as the one from Marcin without the `d.weatherdate <= t.visitdate` condition. – Yohan Obadia Aug 18 '15 at 12:07
  • I added the EXPLAIN to his answer if you want to take a look ;) – Yohan Obadia Aug 18 '15 at 12:21
  • @YohanObadia Mine is not the same as his, I guess the result is different. – Ormoz Aug 18 '15 at 16:09
  • I don't get how in your query you specify that the max value for each row has to be below the visitdate, anyway, I'll give you the explain – Yohan Obadia Aug 19 '15 at 07:21
0

I'm not sure if that's what you need, but it should do the trick.

SELECT t.visitdate, d.city, MAX(d.weatherdate) as beforedate
   FROM cityweather d
   JOIN visit t
   ON d.weatherdate <= t.visitdate
   AND d.city=t.city
   GROUP BY t.visitdate, d.city;
crazy_barber
  • 46
  • 1
  • 3
  • I am trying it right now and keep you updated. If I only needed a group by ('^^)... Thanks – Yohan Obadia Aug 18 '15 at 09:42
  • can you please accept my update for the `EXPLAIN` ? Your query runned for 3 hours and I stopped it to print what I added to your answer. – Yohan Obadia Aug 18 '15 at 12:26
  • Do you have `INDEX(city, weatherdate)` on `cityweather`? – Rick James Aug 18 '15 at 21:57
  • Please `EXPLAIN` Marcin's query. – Rick James Aug 18 '15 at 21:57
  • I wanted to but he refused to have his post edited with the `EXPLAIN` It basicaly says for table `d` that it doesn't use index and need to read 54M rows and the Extra is `Using where; Using temporary; Using filesort` For table `t` it says it uses the index Idxcity with 3 rows and the Extra is `Using where` – Yohan Obadia Aug 19 '15 at 07:36
0

Alternative method, avoiding the MAX()

SELECT v.visitdate, v.city, w.weatherdate AS beforedate
FROM visit v
JOIN cityweather w
        ON v.city = w.city
        AND v.visitdate >= w.weatherdate
        AND NOT EXISTS ( SELECT * FROM cityweather nx
                WHERE nx.city = v.city
                AND nx.weatherdate <= v.visitdate
                AND nx.weatherdate > w.weatherdate
        );
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • Interesting, maybe it can help me make use of my indexes the way I want, I tried to avoid subqueries for performance purpose but I am gonna try it when the other suggestion finished to run. Thanks – Yohan Obadia Aug 18 '15 at 10:42
  • 1
    On normal DBMSses, avoiding the max() often yields better performance, given relevant indexes. (the max() often needs all the rows in the given range) – wildplasser Aug 18 '15 at 11:10
0

I ended up finding the answer myself. It all comes down to narrowing the selection on the table cityweather. So I did it in two steps to avoid the O(n^2) problem we had up until now and I reduce the size of the first table (sometimes virtual table) found in other answers:

First step (the crucial one):

CREATE TABLE intermedtable 
   SELECT t.city, t.visitdate, d.weatherdate
      FROM visit as t 
      JOIN cityweather as d
      WHERE d.city=t.city AND d.weatherdate <= t.visitdate AND d.weatherdate +  interval 1 day >= t.visitdate;

What is crucial here compared to what we had before is the d.weatherdate + interval 1 day >= t.visitdate conditions. It "only" took 22 minutes.

The second step is to find the MAX(weatherdate) for each pair (city, visitdate) :

Create table beforedatetable
   SELECT city, visitdate, max(weatherdate) as beforedate 
       FROM intermedtable
       GROUP BY city, visitdate;

With this solution I go from the 16 hours computation (with crash at the end) down to 32 minutes.

The core of this answer was to reduce the size of the virtual table created in previous answer by adding the d.weatherdate + interval 1 day >= t.visitdate condition. This is based on the fact that the weatherdate of interest can't be more than one day away from the visitdate.

Yohan Obadia
  • 2,552
  • 2
  • 24
  • 31