2

I'm stuck with a (Postgres 9.4.6) query that a) is using too much memory (most likely due to array_agg()) and also does not return exactly what I need, making post-processing necessary. Any input (especially regarding memory consumption) is highly appreciated.

Explanation: the table token_groups holds all words used in tweets I've parsed with their respective occurrence frequency in a hstore, with one row per 10 minutes (for the last 7 days, so 7*24*6 rows in total). These rows are inserted in order of tweeted_at, so I can simply order by id. I'm using row_numberto identify when a word occurred.

# \d token_groups
                                     Table "public.token_groups"
   Column   |            Type             |                         Modifiers
------------+-----------------------------+-----------------------------------------------------------
 id         | integer                     | not null default nextval('token_groups_id_seq'::regclass)
 tweeted_at | timestamp without time zone | not null
 tokens     | hstore                      | not null default ''::hstore
Indexes:
    "token_groups_pkey" PRIMARY KEY, btree (id)
    "index_token_groups_on_tweeted_at" btree (tweeted_at)

What I'd ideally want is a list of words with each the relative distances of their row numbers. So if e.g. the word 'hello' appears in row 5 once, in row 8 twice and in row 20 once, I'd want a column with the word, and an array column returning {5,3,0,12}. (meaning: first occurrence in fifth row, next occurrence 3 rows later, next occurrence 0 rows later, next 12 rows later). If anyone wonders why: 'relevant' words occur in clusters, so (simplified) the higher the standard deviation of timely distances, the more likely a word is a keyword. See more here: http://bioinfo2.ugr.es/Publicaciones/PRE09.pdf

For now, I return an array with positions and an array with frequencies, and use this info to calculate the distances in ruby.

Currently the primary problem is a high memory spike, which seems to be caused by array_agg(). As I'm being told by the (very helpful) heroku staff that some of my connections use 500-700MB with very little shared memory, causing out of memory errors (I'm running Standard-0, which gives me 1GB total for all connections), I need to find an optimization.

The total number of hstore entries is ~100k, which then is aggregated (after skipping words with very low frequency):

SELECT COUNT(*)
FROM (SELECT row_number() over(ORDER BY id ASC) AS position,
            (each(tokens)).key, (each(tokens)).value::integer
      FROM   token_groups) subquery;

 count
--------
 106632

Here is the query causing the memory load:

SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM (
  SELECT row_number() over(ORDER BY id ASC) AS pos, 
         (each(tokens)).key, 
         (each(tokens)).value::integer 
  FROM token_groups
  ) subquery 
GROUP BY key
HAVING SUM(value) > 10;

The output is:

     key     |                        positions                        |           frequencies
-------------+---------------------------------------------------------+-------------------------------
 hello       | {172,185,188,210,349,427,434,467,479}                   | {1,2,1,1,2,1,2,1,4}
 world       | {166,218,265,343,415,431,436,493}                       | {1,1,2,1,2,1,2,1}
 some        | {35,65,101,180,193,198,223,227,420,424,427,428,439,444} | {1,1,1,1,1,1,1,2,1,1,1,1,1,1}
 other       | {77,111,233,416,421,494}                                | {1,1,4,1,2,2}
 word        | {170,179,182,184,185,186,187,188,189,190,196}           | {3,1,1,2,1,1,1,2,5,3,1}
(...)

Here's what explain says:

                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=12789.00..12792.50 rows=200 width=44) (actual time=309.692..343.064 rows=2341 loops=1)
   Output: ((each(token_groups.tokens)).key), array_agg((row_number() OVER (?))), array_agg((((each(token_groups.tokens)).value)::integer))
   Group Key: (each(token_groups.tokens)).key
   Filter: (sum((((each(token_groups.tokens)).value)::integer)) > 10)
   Rows Removed by Filter: 33986
   Buffers: shared hit=2176
   ->  WindowAgg  (cost=177.66..2709.00 rows=504000 width=384) (actual time=0.947..108.157 rows=106632 loops=1)
         Output: row_number() OVER (?), (each(token_groups.tokens)).key, ((each(token_groups.tokens)).value)::integer, token_groups.id
         Buffers: shared hit=2176
         ->  Sort  (cost=177.66..178.92 rows=504 width=384) (actual time=0.910..1.119 rows=504 loops=1)
               Output: token_groups.id, token_groups.tokens
               Sort Key: token_groups.id
               Sort Method: quicksort  Memory: 305kB
               Buffers: shared hit=150
               ->  Seq Scan on public.token_groups  (cost=0.00..155.04 rows=504 width=384) (actual time=0.013..0.505 rows=504 loops=1)
                     Output: token_groups.id, token_groups.tokens
                     Buffers: shared hit=150
 Planning time: 0.229 ms
 Execution time: 570.534 ms

PS: if anyone wonders: every 10 minutes I append new data to the token_groupstable and remove outdated data. Which is easy when storing data one row per 10 minutes, I still have to come up with a better data structure that e.g. uses one row per word. But that does not seem to be the main issue, I think it's the array aggregation.

maia
  • 116
  • 10
  • 1
    Array_agg is memory aggressive. More in combination with hashagg. You need a) increase available RAM (probably not possible on Heroku), b) you can try to disable hashagg by "set enable_hashagg to off" – Pavel Stehule Feb 20 '16 at 03:35
  • Why version 9.4.2? [The latest point release is currently 9.4.6.](http://www.postgresql.org/support/versioning/) `We always recommend that all users run the latest available minor release for whatever major version is in use.` And: `one row per 10 minutes` would be so `7*24*6` rows total. And: is `pos` supposed to be the count of rows or the count of hstore keys within one row? Or a running count over all? – Erwin Brandstetter Feb 20 '16 at 03:41
  • @Pavel , thanks. I cannot increase RAM. As for disabling the query planner's use of hashed aggregation plan types: it does help on my development box, but before I run this on my server: it sounds like lots of possible side effects I'd need to consider? – maia Feb 20 '16 at 09:48
  • @Erwin , thanks, and you're correct, I'm sorry: I am using 9.4.6, and the total is `7*24*6`(it was late, we share the same time zone). As for `pos`, it is the row number of the `token_groups`table, meaning it will increment once for every 10 minutes. The (simplified) concept is basically about: what's the standard deviation of occurrences in tweets grouped by 10 minutes for each used word? If it's high, it's probably a keyword. – maia Feb 20 '16 at 09:48
  • @PavelStehule on my local development installation, disabling hashagg drastically reduces memory – instead of spiking at >400MB, I see no major increase in memory at all. On the other hand the query is 10x as slow – which is fine with me as I only run this query once every 10 minutes. But I'm unsure how many other queries will be affected negatively. Is there a way to disable this query plan only for this query, and/or now knowing what the culprit is, optimizing the query or finding out if it's a bug that should be submitted? Or is there a config setting I could change that will help? Thanks! – maia Feb 20 '16 at 11:37
  • 2
    @maia It should be disabled only for this query - the slowdown is expected :( - hashjoin is usually faster, but more memory expensive. It is normal behave. The statement SET has only session impact, and it is relative usual form of solutions with some problems with planner. You can try new 9.5. There are some optimization of array_agg. It is still memory expensive, but less. – Pavel Stehule Feb 20 '16 at 16:12
  • @PavelStehule thanks, this helps. I've not found a lot about disabling specific query plans yet, and the solution suggested by Erwin does not make it necessary, but it's good to know that this is always an option. – maia Feb 21 '16 at 14:31

1 Answers1

4

Your presented query can be simpler, evaluating each() only once per row:

SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM  (
   SELECT t.key, pos, t.value::int
   FROM  (SELECT row_number() OVER (ORDER BY id) AS pos, * FROM token_groups) tg
        , each(g.tokens) t  -- implicit LATERAL join
   ORDER  BY t.key, pos
   ) sub
GROUP  BY key
HAVING sum(value) > 10;

Also preserving correct order of elements.

What I'd ideally want is a list of words with each the relative distances of their row numbers.

This would do it:

SELECT key, array_agg(step) AS occurrences
FROM  (
   SELECT key, CASE WHEN g = 1 THEN pos - last_pos ELSE 0 END AS step
   FROM  (
      SELECT key, value::int, pos
           , lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos) AS last_pos
      FROM  (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) tg
           , each(g.tokens) t
      ) t1
      , generate_series(1, t1.value) g
   ORDER  BY key, pos, g
   ) sub
GROUP  BY key;
HAVING count(*) > 10;

SQL Fiddle.

Interpreting each hstore key as a word and the respective value as number of occurrences in the row (= for the last 10 minutes), I use two cascading LATERAL joins: 1st step to decompose the hstore value, 2nd step to multiply rows according to value. (If your value (frequency) is mostly just 1, you can simplify.) About LATERAL:

Then I ORDER BY key, pos, g in the subquery before aggregating in the outer SELECT. This clause seems to be redundant, and in fact, I see the same result without it in my tests. That's a collateral benefit from the window definition of lag() in the inner query, which is carried over to the next step unless any other step triggers re-ordering. However, now we depend on an implementation detail that's not guaranteed to work.

Ordering the whole query once should be substantially faster (and easier on the required sort memory) than per-aggregate sorting. This is not strictly according to the SQL standard either, but the simple case is documented for Postgres:

Alternatively, supplying the input values from a sorted subquery will usually work. For example:

SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

But this syntax is not allowed in the SQL standard, and is not portable to other database systems.

Strictly speaking, we only need:

ORDER BY pos, g

You could experiment with that. Related:

Possible alternative:

SELECT key
     , ('{' || string_agg(step || repeat(',0', value - 1), ',') || '}')::int[] AS occurrences
FROM (
   SELECT key, pos, value::int
        ,(pos - lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos))::text AS step
   FROM  (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) g
        , each(g.tokens) t
   ORDER  BY key, pos
   ) t1
GROUP  BY key;
-- HAVING sum(value) > 10;

Might be cheaper to use text concatenation instead of generate_series().

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks a lot. The query returned a `syntax error at or near "*"`, but moving it to before row_number() solved the issue. Optimizing speed is certainly helpful, now I need to find a way to reduce the load of `array_agg()`, especially if it's run twice (once for positions, once for frequencies). If I could run it only once (one array with distances between occurrenes, especially as that's what I need after all), it might reduce the memory spike. In case you have any suggestions on that, I'd be very thankful. – maia Feb 20 '16 at 11:07
  • btw: @Erwin you can't imagine in how many cases I've been searching for postgres issues in the last months and found one of your answers here to be very helpful. Lots of what I've learned about Postgres I've learned it from your activity here. So many thanks for that! – maia Feb 20 '16 at 11:08
  • Oh, I changed the query to `array_agg(t.val::int)` as I need an array of integers. I'm unsure if it's now really faster (and/or less memory intense) as I will now cast to integer twice. – maia Feb 20 '16 at 11:25
  • @maia: There was a typo in my first draft, but never mind that. Consider the new answer. – Erwin Brandstetter Feb 20 '16 at 13:21
  • thanks a lot, that helps a lot, it also solves the memory issue! I will need some time to understand how you did it exactly. I now have two issues to figure out, one is the comparably slow speed ([explain.depesz.com](http://explain.depesz.com/s/xQvN)), the other (which I didn't mention previously) is that I'd also need the distance to the max row_number. I assume I will be able to figure out the latter sooner or later, but as for the first, if you have any additional suggestions, I'm looking forward to it. But for now: thanks for all the time you've invested! – maia Feb 20 '16 at 13:46
  • am I correct that the line `ORDER BY key, pos, g` is redundant? It's causing a massive sort (increasing total query time to ~300%), and does not seem to change the output at all. ps: neither was I able to figure out how to simplify if `value` (frequency) is often at 1, nor how to append the distance of the last occurrence to the count of rows +1 for each word. I've tried generating a temp table and running a `union all` of the query with `lag` also with `lead`, but I failed (too slow). I guess this is something I need to extract to a simplified question and repost it here. – maia Feb 21 '16 at 14:29
  • @maia: I added more explanation and links for `ORDER BY` and an alternative query. For the additional question about "distance of the last occurrence" start a new question please, this one is too convoluted already. It's a QA site. You can always link to this one for context. – Erwin Brandstetter Feb 23 '16 at 01:26