3

This is a postgresql problem.

PostgreSQL 8.3.3 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-9).

The table looks like:

date_time           other_column
2012-11-01 00:00:00 ...
2012-11-02 01:00:00 ...
2012-11-02 02:00:00 ...
2012-11-02 03:00:00 ...
2012-11-02 04:00:00 ...
2012-11-03 05:00:00 ...
2012-11-03 06:00:00 ...
2012-11-05 00:00:00 ...
2012-11-07 00:00:00 ...
2012-11-07 00:00:00 ...
...

I want to select at most 3 records per day from a specific date range.

For example, I want to select at most 3 records from 2012-11-02 to 2012-11-05. The expected result would be:

date_time           other_column
2012-11-02 01:00:00 ...
2012-11-02 02:00:00 ...
2012-11-02 03:00:00 ...
2012-11-03 05:00:00 ...
2012-11-03 06:00:00 ...
2012-11-05 00:00:00 ...

I have spent a few hours on this and still cannot figure it out. Please help me. :(

UPDATE: The current sql I tried could only select one record per day:

SELECT DISTINCT ON (TO_DATE(SUBSTRING((date_time || '') FROM 1 FOR 10), 'YYYY-MM-DD')) *
FROM myTable
WHERE  date_time >=  '20121101 00:00:00'  
AND  date_time <= '20121130 23:59:59'
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Drogba
  • 4,186
  • 4
  • 23
  • 31
  • 1
    Add your relevant java code please which you have tried.. – thar45 Nov 16 '12 at 04:21
  • 1
    this is a sql problem actually. – Drogba Nov 16 '12 at 04:22
  • Is this a greatest-n-per-group problem? See http://stackoverflow.com/questions/751399/sql-query-how-to-apply-limit-within-group-by. I added that tag, please post current query – Jayan Nov 16 '12 at 04:26
  • 1
    It appears you mean *at most 3 records **per day**.*. Correct? Also: At most 3 records **according to what criteria**? Pick an arbitrary record? Greatest? Least? What? Looks like you've used least, but that's good to specify. – Craig Ringer Nov 16 '12 at 05:33
  • 2
    `(TO_DATE(SUBSTRING((date_time || '') FROM 1 FOR 10), 'YYYY-MM-DD')` appears to be a particularly ghastly way to write `date_trunc('day', date_time)`. See http://www.postgresql.org/docs/current/static/functions-datetime.html and http://sqlfiddle.com/#!12/0fd51/13 – Craig Ringer Nov 16 '12 at 06:15
  • vote up for the date_trunc function. I haven't notice it. Thanks – Drogba Nov 16 '12 at 06:16

3 Answers3

3

The following answers all use date_trunc('day',date_time) or just cast to date to truncate a timestamp to a date. There's no need to jump through hoops with date formatting and strings. See Date/time functions in the manual.

This SQLFiddle shows three possible answers: http://sqlfiddle.com/#!12/0fd51/14, all of which produce the same result for the input data (but not necessarily the same result if date_time can have duplicates in it).

To solve your problem you could use a correlated subquery with a limit to generate an IN-list to filter on:

SELECT a.date_time, a.other_column
FROM table1 a
WHERE a.date_time IN (
  SELECT b.date_time
  FROM table1 b
  WHERE b.date_time IS NOT NULL
    AND a.date_time::date = b.date_time::date
  ORDER BY b.date_time
  LIMIT 3
)
AND a.date_time::date BETWEEN '2012-11-02' AND '2012-11-05';

This should be the most portable approach - though it won't work with MySQL (at least as of 5.5) because MySQL doesn't support LIMIT in a subquery used in an IN clause. It works in SQLite3 and PostgreSQL, though, and should work in most other DBs.

Another option would be to select the range of dates you wanted, annotate the rows within the range with a row number using a window function, then filter the output to exclude excess rows:

SELECT date_time, other_column
FROM (
  SELECT 
    date_time, 
    other_column, 
    rank() OVER (PARTITION BY date_trunc('day',date_time) ORDER BY date_time) AS n
  FROM Table1
  WHERE date_trunc('day',date_time) BETWEEN '2012-11-02' AND '2012-11-05'
  ORDER BY date_time
) numbered_rows
WHERE n < 4;

If ties are a possibility, ie if date_time is not unique, then consider using either the rank or dense_rank window functions instead of row_number to get deterministic results, or add an additional clause to the ORDER BY in row_number to break the tie.

If you use rank then it'll include none of the rows if it can't fit all of them in; if you use dense_rank it'll include all of them even if it has to go over the 3-row-per-day limit to do so.

All sorts of other processing are possible this way too, using the window specification.


Here's yet another formulation that uses array aggregation and slicing, which is completely PostgreSQL specific but fun.

SELECT b.date_time, b.other_column 
FROM (
  SELECT array_agg(a.date_time ORDER BY a.date_time)
  FROM table1 a
  WHERE a.date_time::date BETWEEN '2012-11-02' 
    AND '2012-11-05'
  GROUP BY a.date_time::date
) x(arr) 
INNER JOIN table1 b ON (b.date_time = ANY (arr[1:3]));
Community
  • 1
  • 1
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
3

I want to select at most 3 records per day from a specific date range.

SELECT date_time, other_column
FROM  (
   SELECT *, row_number() OVER (PARTITION BY date_time::date) AS rn
   FROM   tbl
   WHERE  date_time >= '2012-11-01 0:0'
   AND    date_time <  '2012-12-01 0:0'
   ) x
WHERE  rn < 4;

Major points

  • Use the window function row_number(). rank() or dense_rank() would be wrong according to the question - more than 3 records might be selected with timestamp duplicates.

  • Since you do not define which rows you want per day, the correct answer is not to include an ORDER BY clause in the window function. Gives you an arbitrary selection, which matches the question.

  • I changed your WHERE clause from

    WHERE  date_time >= '20121101 00:00:00'  
    AND    date_time <= '20121130 23:59:59'
    

    to

    WHERE  date_time >=  '2012-11-01 0:0'  
    AND    date_time <   '2012-12-01 0:0'
    

    Your syntax would fail for corner cases like '20121130 23:59:59.123'.

    What @Craig suggested:

    date_time::date BETWEEN '2012-11-02' AND '2012-11-05'
    

    .. would work correctly, but is an anti-pattern regarding performance. If you apply a cast or a function to your database column in the expression, plain indexes cannot be used.

Solution for PostgreSQL 8.3

Best solution: Upgrade to a more recent version, preferably to the current version 9.2.

Other solutions:

For only few days you could employ UNION ALL:

SELECT date_time, other_column
FROM   tbl t1
WHERE  date_time >= '2012-11-01 0:0'
AND    date_time <  '2012-11-02 0:0'
LIMIT  3
)
UNION ALL 
(
SELECT date_time, other_column
FROM   tbl t1
WHERE  date_time >= '2012-11-02 0:0'
AND    date_time <  '2012-11-03 0:0'
LIMIT  3
)
...

Parenthesis are not optional here.

For more days there are workarounds with generate_series() - something like I posted here (including a link to more).

I might have solved it with a plpgsql function back in the old days before we had window functions:

CREATE OR REPLACE FUNCTION x.f_foo (date, date, integer
                         , OUT date_time timestamp, OUT other_column text)
  RETURNS SETOF record AS
$BODY$
DECLARE
    _last_day date;          -- remember last day
    _ct       integer := 1;  -- count
BEGIN

FOR date_time, other_column IN
   SELECT t.date_time, t.other_column
   FROM   tbl t
   WHERE  t.date_time >= $1::timestamp
   AND    t.date_time <  ($2 + 1)::timestamp
   ORDER  BY t.date_time::date
LOOP
   IF date_time::date = _last_day THEN
      _ct := _ct + 1;
   ELSE
      _ct := 1;
   END IF;

   IF _ct <= $3 THEN
      RETURN NEXT;
   END IF;

   _last_day := date_time::date;
END LOOP;

END;
$BODY$ LANGUAGE plpgsql STABLE STRICT;

COMMENT ON FUNCTION f_foo(date3, date, integer) IS 'Return n rows per day
$1 .. date_from (incl.)
$2 .. date_to  (incl.)
$3 .. maximim rows per day';

Call:

SELECT * FROM f_foo('2012-11-01', '2012-11-05', 3);
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Good point re cast vs time ranges. I covered the issues with `rank` and `dense_rank`; I was just suggesting them as tiebreakers if the rather under-specified question turned out to need more deterministic row selection at the expense of always getting exactly 3 rows. – Craig Ringer Nov 20 '12 at 12:41
  • I got an error while executing the query. ERROR: syntax error at or near "OVER" http://www.postgresql.org/docs/9.1/static/tutorial-window.html I have tried trimmed it down to just run SELECT *, row_number() OVER() AS rn FROM mytable WHERE date_time>= '2012-11-01 0:0' AND date_time< '2012-12-01 0:0' It still not working. – Drogba Nov 26 '12 at 02:19
  • @Drogba: Seems like you neglected to mention your outdated version of PostgreSQL. Window functions were introduced with **PostgreSQL 8.4**. – Erwin Brandstetter Nov 26 '12 at 02:49
  • omg, PostgreSQL 8.3.3 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-9). Any alternative? – Drogba Nov 26 '12 at 03:37
  • 1
    @Drogba: I added solutions for 8.3. Next time please remember to mention your version number right away. – Erwin Brandstetter Nov 26 '12 at 15:04
-3

I would use a sub-select and a left outer join. This should do the trick:

select distinct(date_format(a.date_time,"%Y-%m-%d")) date_time, b.* from table a
left outer join (
  select date_format(date_time,"%Y-%m-%d") dt, * from table limit 3
) b 
on date_format(a.date_time,"%Y-%m-%d") = b.dt; 
Tiit
  • 520
  • 4
  • 15
  • @BhavikAmbani It isn't clear what you're trying to say here. That said, the code above is obviously wrong, in that it uses `"double quotes"` (identifier quoting) for `'literal strings'`, per http://sqlfiddle.com/#!12/0fd51/4 – Craig Ringer Nov 16 '12 at 06:07
  • @CraigRinger Thats what I am saying – Bhavik Ambani Nov 16 '12 at 06:08
  • @Tiit Once I correct the obvious errors, this still fails; it appears to be a MySQL-only query using the MySQL `date_format` function. In PostgreSQL one would use `to_char` instead of `date_format`, but in this case you're really truncating the timestamp to a date so you should use `date_trunc` instead. Yours after quote correction: http://sqlfiddle.com/#!12/0fd51/8 . It also seems to discard the time-part completely; see here after correction of date truncation: http://sqlfiddle.com/#!12/0fd51/9 – Craig Ringer Nov 16 '12 at 06:10
  • 1
    A subquery is evaluated *before* the outer query. This query is just utter nonsense in *any* RDBMS. – Erwin Brandstetter Nov 16 '12 at 13:36