Improving Subquery performance in Postgres

Question

I have these two tables in my database

  Student Table                   Student Semester Table
| Column     : Type     |       | Column     : Type     |
|------------|----------|       |------------|----------|
| student_id : integer  |       | student_id : integer  |      
| satquan    : smallint |       | semester   : integer  |
| actcomp    : smallint |       | enrolled   : boolean  | 
| entryyear  : smallint |       | major      : text     |
|-----------------------|       | college    : text     |
                                |-----------------------|

Where student_id is a unique key in the student table, and a foreign key in the student semester table. The semester integer is just a 1 for the first semester, 2 for the second, and so on.

I'm doing queries where I want to get the students by their entryyear (and sometimes by their sat and/or act scores), then get all of those students associated data from the student semester table.

Currently, my queries look something like this:

SELECT * FROM student_semester
WHERE student_id IN(
    SELECT student_id FROM student_semester
    WHERE student_id IN(
        SELECT student_id FROM student WHERE entryyear = 2006
    ) AND college = 'AS' AND ...
)
ORDER BY student_id, semester;

But, this results in relatively long running queries (400ms) when I am selecting ~1k students. According to execution plan, most of the time is spent doing a hash join. To ameliorate this, I have added satquan, actpcomp, and entryyear columns to the student_semester table. This reduces the time to run the query by ~90%, but results in a lot of redundant data. Is there a better way to do this?

These are the indexes that I currently have (Along with the implicit indexes on student_id):

CREATE INDEX act_sat_entryyear ON student USING btree (entryyear, actcomp, sattotal)
CREATE INDEX student_id_major_college ON student_semester USING btree (student_id, major, college)

Query Plan

QUERY PLAN
Hash Join  (cost=17311.74..35895.38 rows=81896 width=65) (actual time=121.097..326.934 rows=25680 loops=1)
  Hash Cond: (public.student_semester.student_id = public.student_semester.student_id)
  ->  Seq Scan on student_semester  (cost=0.00..14307.20 rows=698820 width=65) (actual time=0.015..154.582 rows=698820 loops=1)
  ->  Hash  (cost=17284.89..17284.89 rows=2148 width=8) (actual time=121.062..121.062 rows=1284 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 51kB
        ->  HashAggregate  (cost=17263.41..17284.89 rows=2148 width=8) (actual time=120.708..120.871 rows=1284 loops=1)
              ->  Hash Semi Join  (cost=1026.68..17254.10 rows=3724 width=8) (actual time=4.828..119.619 rows=6184 loops=1)
                    Hash Cond: (public.student_semester.student_id = student.student_id)
                    ->  Seq Scan on student_semester  (cost=0.00..16054.25 rows=42908 width=4) (actual time=0.013..109.873 rows=42331 loops=1)
                          Filter: ((college)::text = 'AS'::text)
                    ->  Hash  (cost=988.73..988.73 rows=3036 width=4) (actual time=4.801..4.801 rows=3026 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 107kB
                          ->  Bitmap Heap Scan on student  (cost=71.78..988.73 rows=3036 width=4) (actual time=0.406..3.223 rows=3026 loops=1)
                                Recheck Cond: (entryyear = 2006)
                                ->  Bitmap Index Scan on student_act_sat_entryyear_index  (cost=0.00..71.03 rows=3036 width=0) (actual time=0.377..0.377 rows=3026 loops=1)
                                      Index Cond: (entryyear = 2006)
Total runtime: 327.708 ms

I was mistaken about there not being a Seq Scan in the query. I think the Seq Scan is being done due to the number of rows that match the college condition; when I change it to one that has less students an index is used. Source: https://stackoverflow.com/a/5203827/880928

Query with entryyear column included student semester table

SELECT * FROM student_semester
WHERE student_id IN(
    SELECT student_id FROM student_semester
    WHERE entryyear = 2006 AND collgs = 'AS'
) ORDER BY student_id, semester;

Query Plan

Sort  (cost=18597.13..18800.49 rows=81343 width=65) (actual time=72.946..74.003 rows=25680 loops=1)
  Sort Key: public.student_semester.student_id, public.student_semester.semester
  Sort Method: quicksort  Memory: 3546kB
  ->  Nested Loop  (cost=9843.87..11962.91 rows=81343 width=65) (actual time=24.617..40.751 rows=25680 loops=1)
        ->  HashAggregate  (cost=9843.87..9845.73 rows=186 width=4) (actual time=24.590..24.836 rows=1284 loops=1)
              ->  Bitmap Heap Scan on student_semester  (cost=1612.75..9834.63 rows=3696 width=4) (actual time=10.401..23.637 rows=6184 loops=1)
                    Recheck Cond: (entryyear = 2006)
                    Filter: ((collgs)::text = 'AS'::text)
                    ->  Bitmap Index Scan on entryyear_act_sat_semester_enrolled_cumdeg_index  (cost=0.00..1611.82 rows=60192 width=0) (actual time=10.259..10.259 rows=60520 loops=1)
                          Index Cond: (entryyear = 2006)
        ->  Index Scan using student_id_index on student_semester  (cost=0.00..11.13 rows=20 width=65) (actual time=0.003..0.010 rows=20 loops=1284)
              Index Cond: (student_id = public.student_semester.student_id)
Total runtime: 74.938 ms

Please post the execution plan using `explain analyze` and any index defined on the tables. More on posting this kind of questions here: https://wiki.postgresql.org/wiki/Slow_Query_Questions — , May 08 '13 at 16:48
When asking for performance optimization you also have to provide your version of Postgres. Should go without saying. Read the [tag info for postgresql-performance](http://stackoverflow.com/tags/postgresql-performance/info) — Erwin Brandstetter, May 08 '13 at 16:50
@ErwinBrandstetter I didn't post the version of Postgres because I thought that this was more of a general database schema/query strategy question, but I'll add the version as well as the query plan. — cmorse, May 08 '13 at 17:05
Do you want students who entered _in AS in 2006_ or students who entered in 2006 (in any college) who _at some time_ were in AS? And with respect to your last version, I suggest you try it with the `IN` replaced by a similar `EXISTS` (see my answer below) _and_ add an index on `student_id, entry_year`. — Andrew Lazarus, May 08 '13 at 23:10
Before adding some indexes, I would advise to add primary key constraints to the tables. For student that would obviously be `{student_id}` , and for student_semester *probably* `{student_id, semester}` , but this is not clear from the question. Also: the specificity for `entryyear` will probably be too low to afford an index scan anyway (unless you have more than about 20 years of data) — wildplasser, May 09 '13 at 12:14
Right now I have 10 years of data for ~35k students and about 600k semester rows. I do have a primary key on student_id in student, I can play around with adding one to student_semester. I should get a chance to try the other queries here later today, thanks for all the help! — cmorse, May 09 '13 at 13:34

score 3 · Answer 1 · answered May 08 '13 at 17:51

3

An alternative approach to doing the query is to use window functions.

select t.*  -- Has the extra NumMatches column.  To eliminate it, list the columns you want
from (select ss.*,
             sum(case when ss.college = 'AS' and s.entry_year = 206 then 1 else 0 end) over
                  (partition by student_id) as NumMatches
      from student_semester ss join
           student s
           on ss.student_id = s.student_id
    ) t
where NumMatches > 0;

Window functions are usually faster than joining in an aggregation, so I suspect that this might perform well.

answered May 08 '13 at 17:51

Gordon Linoff

1,242,037
58
646
786

This one actually ran substantially slower than the original query (almost 1 full second). It took about 1 second to complete. According to the query plan it was scanning every row in the table 3 separate times (even though it claimed to be using the indexes). – cmorse May 10 '13 at 22:08
@cmorse . . . Interesting. I'm glad you did the test. The difference in the queries, I think, is that this is calculating `NumMatches` over all the data, instead of a subset. The selectivity of the aggregation overcomes (what I believe to be) the slightly better performance of the window function. – Gordon Linoff May 10 '13 at 22:53
1

Thanks for posting this query. I've never done much with window functions. It was interesting seeing it done. – cmorse May 10 '13 at 22:58

Clodoaldo Neto · Answer 2 · 2013-05-08T16:47:32.363

2

The clean version of you query is

select ss.*
from
    student s
    inner join
    student_semester ss using(student_id)
where
    s.entryyear = 2006
    and exists (
        select 1
        from student_semester
        where
            college = 'AS'
            and student_id = s.student_id
    )
order by ss.student_id, semester

edited May 08 '13 at 16:47

answered May 08 '13 at 15:37

Clodoaldo Neto

118,695
26
233
260

I'd expect this to perform well if there are indexes covering student.entryyear and student_semester.college, and student_semester.semester. On the other hand, if there are only 2 values in student_semester.semester, *that* could be annoying. EXPLAIN ANALYZE would tell the whole story. – Mike Sherrill 'Cat Recall' May 08 '13 at 15:51
This isn't the same query. This only returns rows from the 'AS' college. The original query returns records for students who are ever in the 'AS' college. – Gordon Linoff May 08 '13 at 16:09
@Gordon I don't understand the _who are ever in the 'AS' college_ part of your comment. – Clodoaldo Neto May 08 '13 at 16:18
2

@ClodoaldoNeto The query is intended to find students who were in the 'AS' college in at least one semester. Students can be in different colleges depending on the semester. – cmorse May 08 '13 at 16:38
I ran this one. It performed about as well as the original query. I posted the EXPLAIN ANALYZE here: http://pastebin.com/u4fneiQT – cmorse May 10 '13 at 22:43

Andrew Lazarus · Answer 3 · 2013-05-11T01:34:07.280

You want, it appears, students who entered in 2006 and who have ever been in AS college.

Version One.

SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
     AND student_id IN (SELECT student_id 
                        FROM student_semester s2 WHERE s2.college='AS')
     AND /* other criteria */
ORDER BY sem.student_id, semester;

Version Two

SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
     AND EXISTS 
         (SELECT 1 FROM student_semester s2 
          WHERE s2.student_id = s.student_id AND s2.college='AS')
          -- CREATE INDEX foo on student_semester(student_id, college);
     AND /* other criteria */
ORDER BY sem.student_id, semester;

I expect both to be fast, but whether they one performs better than the other (or exact same plan) is a PG mystery.

[EDIT] Here's a version with no semi-joins. I wouldn't expect it to work well because it will give multiple hits for each time a student was in AS.

SELECT DISTINCT ON ( /* PK of sem */ )
FROM student s 
   JOIN student_semester sem USING (student_id) 
   JOIN student_semester s2  USING (student_id)
WHERE s.entry_year=2006
   AND s2.college='AS'
ORDER BY sem.student_id, semester;

Neither of these actually performed better than the original query. Here are the query plans. Version 1: http://pastebin.com/zXafx0ct, Version two: http://pastebin.com/vntd96dU — cmorse, May 10 '13 at 22:54
That's rather disappointing. I have one other possibility up added in edit. And BTW what are the indexes on `student_semester`? — Andrew Lazarus, May 11 '13 at 01:26

Improving Subquery performance in Postgres

3 Answers3