2

I got a pretty common (at least as I think) database structure: there are news (News(id, source_id)), each news has a source (Source(id, url)). Sources are aggregated to topics (Topic(id, title)) via TopicSource(source_id, topic_id). In addition there are users (User(id, name)) which can mark news as read via NewsRead(news_id, user_id). Here is a diagram to clear things up: db diagram

I want to count unread news in topics for specific user. The problem is News table is a big one (10^6 - 10^7 rows). Fortunately, I don't need to know exact count, it's ok to stop counting after a threshold returning this threshold as a counted value.

Following this answer for a one topic I came up with a following query:

SELECT t.topic_id, count(1) as unread_count
FROM (
 SELECT 1, topic_id
 FROM news n
   JOIN topic_source t ON n.source_id = t.source_id
   -- join news_read to filter already read news
   LEFT JOIN news_read r
     ON (n.id = r.news_id AND r.user_id = 1)
 WHERE t.topic_id = 3 AND r.user_id IS NULL
 LIMIT 10 -- Threshold
) t GROUP BY t.topic_id;

(query plan 1). This query takes about 50 ms on test db which is acceptable.

Now a want to select unread count for multiple topics. I tried to select like that:

SELECT
  t.topic_id,
  (SELECT count(1)
   FROM (SELECT 1 FROM news n
          JOIN topic_source tt ON n.source_id = tt.source_id
          LEFT JOIN news_read r
            ON (n.id = r.news_id AND r.user_id = 1)
          WHERE tt.topic_id = t.topic_id AND r.user_id IS NULL
          LIMIT 10 -- Threshold
        ) t) AS unread_count
FROM topic_source t WHERE t.topic_id IN (1, 2) GROUP BY t.topic_id;

(query plan 2). But for the reason unknown to me it takes about 1.5 s on test data while sum of individual queries should get about 0.2-0.3 s.

I'm clearly missing something here. Is there a mistake in second query? Is there a better (faster) way to select a count of unread news?

Additional info:

Table sizes:

News - 10^6 - 10^7
User - 10^3
Source - 10^4
Topic - 10^3
TopicSource - 10^5
NewsRead - 10^6

UPD: query plans clearly show I messed up second query. Any cues are appreciated.

UPD2: I tried this query with lateral join which is supposed to simply run first (the fastest one) query for each topic_id:

SELECT
  id,
  count(*)
FROM topic t
  LEFT JOIN LATERAL (
     SELECT ts.topic_id
     FROM news n
       LEFT JOIN news_read r
         ON (n.id = r.news_id AND r.user_id = 1)
       JOIN topic_source ts ON n.source_id = ts.source_id
     WHERE ts.topic_id = t.id AND r.user_id IS NULL
     LIMIT 10
) p ON TRUE
WHERE t.id IN (4, 10, 12, 16)
GROUP BY t.id;

(query plan 3). But it seems that Pg planner has different opinion on this - it runs very slow seq scans and hash joins instead of index scans and merge joins.

9dogs
  • 1,446
  • 10
  • 18
  • I wonder how [this](https://paste.ofcode.org/TMhZbxCGqiSgc3ijhzZwfX) query would fare on your data. I tried creating similar volumes of sample data as yours, but since the distributions are so different I get very different results even for your original queries; for example the multi topic query only takes ~19ms (best out of many). – Ilja Everilä Dec 15 '17 at 15:30
  • @IljaEverilä, thank you for suggestion! This query takes ~3.5s on my data. I guess our distributions are way off. [Explain is here](https://explain.depesz.com/s/q740). Suddenly it appears that multiple `UNION ALL` are very fast. I will update my post after short research. – 9dogs Dec 15 '17 at 16:29
  • I noticed that I forgot the LIMIT 10 from the innermost sub query. That's what you get for copying just some of the many attempts. Perhaps it'd run a bit faster with that in place. – Ilja Everilä Dec 15 '17 at 16:40
  • Ah, I missed that as well. Adding `LIMIT` do help but some strange behaviour appears: `LIMIT 10` works very fast - 25ms max as well as `LIMIT 20`. `LIMIT 100` - ~1.2s cold one and ~0.7s subsequent ones. `LIMIT 1` runs more than 2 seconds - I don't even think I understand something now. – 9dogs Dec 15 '17 at 16:51

1 Answers1

0

After some benchmarking I've finally stopped on simple UNION ALL query and it's ten times faster than lateral join on my data:

SELECT
  p.topic_id,
  count(*)
FROM (
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 4 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 10 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 12 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 16 AND r.user_id IS NULL
              LIMIT 100
            ) t1
     ) p
GROUP BY p.topic_id;

(execute plan)

Intuition here is that by specifying topic_id`s explicitly one provide Pg planner with enough information to build an effective plan.

From SQLAlchemy point of view it's pretty straightforward:

# topic_ids, user_id are defined elsewhere, e.g.
# topic_ids = [4, 10, 12, 16]
# user_id = 1
for topic_id in topic_ids:
    topic_query = (
        db.session.query(News.id, TopicSource.topic_id)
        .join(TopicSource, TopicSource.source_id == News.source_id)
        # LEFT JOIN NewsRead table to filter only unreads
        # (where News.user_id IS NULL)
        .outerjoin(NewsRead,
                   and_(NewsRead.news_id == News.id,
                        NewsRead.user_id == user_id))
        .filter(TopicSource.topic_id == topic_id,
                NewsRead.user_id.is_(None))
        .limit(100))
    topic_queries.append(topic_query)
# Unite queries with UNION ALL
union_query = topic_queries[0].union_all(*topic_queries[1:])
# Groups query by `topic_id` and count unreads
counts = (union_query
          # Using `with_entities(func.count())` to avoid
          # a subquery.  See link below for info:
          # https://gist.github.com/hest/8798884
          .with_entities(TopicSource.topic_id.label('topic_id'),
                         func.count().label('unread_count'))
          .group_by(TopicSource.topic_id))
result = counts.all()
9dogs
  • 1,446
  • 10
  • 18