1

I have two tables, apps and reviews (simplified for the sake of discussion):

apps table

id int

reviews table

id          int
review_date date
app_id      int (foreign key that points to apps)

2 questions:

1. How can I write a query / function to answer the following question?:

Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?

I think I know how to write a query if given an explicit date:

SELECT
  apps.id,
  count(reviews.*)
FROM
  reviews
  INNER JOIN apps ON apps.id = reviews.app_id
group by
  1
having
  min(reviews.review_date) >= '2020-01-01'
  order by 2 desc
limit 10;

But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.

2. What's the best way to model this data?

It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:

... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...

But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Eugene Kim
  • 496
  • 4
  • 17

3 Answers3

2

I think this is what you are looking for:

Postgres 13 or newer

WITH cte AS (  -- MATERIALIZED
   SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
   FROM   reviews
   GROUP  BY 1
   )
SELECT *
FROM  (
   SELECT generate_series(min(review_date)
                        , max(review_date)
                        , '1 day')::date
   FROM   reviews
   ) d(review_window_start)
LEFT  JOIN LATERAL (
   SELECT total_ct, array_agg(app_id) AS apps
   FROM  (
      SELECT app_id, total_ct
      FROM   cte c
      WHERE  c.earliest_review >= d.review_window_start
      ORDER  BY total_ct DESC
      FETCH  FIRST 1 ROWS WITH TIES  -- new & hot
      ) sub
   GROUP  BY 1
   ) a ON true;

WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:

Postgres 12 or older

WITH cte AS (  -- MATERIALIZED
   SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
   FROM   reviews
   GROUP  BY 1
   )
SELECT *
FROM  (
   SELECT generate_series(min(review_date)
                        , max(review_date)
                        , '1 day')::date
   FROM   reviews
   ) d(review_window_start)
LEFT  JOIN LATERAL (
   SELECT total_ct, array_agg(app_id) AS apps
   FROM  (
      SELECT total_ct, app_id
          ,  rank() OVER (ORDER BY total_ct DESC) AS rnk
      FROM   cte c
      WHERE  c.earliest_review >= d.review_window_start
      ) sub
   WHERE  rnk = 1
   GROUP  BY 1
   ) a ON true;

db<>fiddle here

Same as above, but without WITH TIES.

We don't need to involve the table apps at all. The table reviews has all information we need.

The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:

The optimized generate_series() call produces the series of days from earliest to latest review. See:

Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you! Your query is much more performant than the one I hobbled together. Thank you for the response and the related resources. – Eugene Kim Aug 09 '20 at 02:44
0

If you are looking for hints, then here are a few:

  1. Are you aware of generate_series() and how to use it to compose a table of dates given a start and end date? If not, then there are plenty of examples on this site.
  2. To answer this question for any given date, you need to have only two measures for each app, and only one of these is used to compare an app against other apps. Your query in part 1 shows that you know what these two measures are.
  3. Hints 1 and 2 should be enough to get this done. The only thing I can add is for you not to worry about making the database do "too much work." That is what it is there to do. If it does not do it quickly enough, then you can think about optimizations, but before you get to that step, concentrate on getting the answer that you want.

Please comment if you need further clarification on this.

Mike Organek
  • 11,647
  • 3
  • 11
  • 26
  • Thanks for the reply, Mike Organek. I'm aware of `generate_series` but the missing piece for me was LATERAL join, which lets me "pass" in the value of each date in `generate_series` into the subquery. – Eugene Kim Aug 08 '20 at 21:29
0

The missing piece for me was lateral join. I can accomplish just about what I want using the following:


select
  review_windows.review_window_start,
  id,
  review_total,
  earliest_review
from
  (
    select
      date_trunc('day', review_windows.review_windows) :: date as review_window_start
    from
      generate_series(
        (
          SELECT
            min(reviews.review_date)
          FROM
            reviews
        ),
        (
          SELECT
            max(reviews.review_date)
          FROM
            reviews
        ),
        '1 year'
      ) review_windows
    order by
      1 desc
  ) review_windows
  left join lateral (
    SELECT
      apps.id,
      count(reviews.*) as review_total,
      min(reviews.review_date) as earliest_review
    FROM
      reviews
      INNER JOIN apps ON apps.id = reviews.app_id
    where
      reviews.review_date >= review_windows.review_window_start
    group by
      1
    having
      min(reviews.review_date) >= review_windows.review_window_start
    order by
      2 desc,
      3 desc
    limit
      2
  ) apps_most_reviews on true;
Eugene Kim
  • 496
  • 4
  • 17