Selecting first N rows, where sum of lengths of TEXT field is up to some limit

Question

I have a table like that:

CREATE TABLE cache (
  id BIGSERIAL PRIMARY KEY,
  source char(2) NOT NULL,
  target char(2) NOT NULL,
  q TEXT NOT NULL,
  result TEXT,
  profile TEXT NOT NULL DEFAULT '',
  created TIMESTAMP NOT NULL DEFAULT now(),
  api_engine text NOT NULL,
  encoded TEXT NOT NULL
);

I want to pass over the list of encoded field (maybe OVER ... WINDOW ?) with something like:

SELECT id, string_agg(encoded, '&q=') FROM cache

so I will have the list of corresponding ids, and a string of concatenated fields encoded: '&q=encoded1&q=encoded2&q=encoded3' ... with total length not exceeding some limit (like not more than 2000 chars).

The second condition, I want to go to the next window, when one of those fields: source, target or profile are changed.

If that possible with SQL SELECT in FOR LOOP?

I know how to do that with plpgsql/plpython/plperl, but I want to optimize this request.

FOR rec IN
  SELECT array_agg(id) AS ids, string_agg(encoded, '&q=') AS url FROM cache
  WHERE result IS NULL
  ORDER BY source, target
LOOP
  -- here I call curl with that *url*

Example data:

INSERT INTO cache (id, source, target, q, result, profile, api_engine, encoded) VALUES
   (1, 'ru', 'en', 'Длинная фраза по-русски'            , NULL, '', 'google', '%D0%94%D0%BB%D0%B8%D0%BD%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8')
 , (2, 'ru', 'es', 'Ещё одна непонятная фраза по-русски', NULL, '', 'google', '%D0%95%D1%89%D1%91+%D0%BE%D0%B4%D0%BD%D0%B0+%D0%BD%D0%B5%D0%BF%D0%BE%D0%BD%D1%8F%D1%82%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8')
-- etc...

and so on, 100500 rows like that. Fields source and target can be different language codes, and they repeat, so I need maybe to do GROUP BY source, target, profile.

I want to SELECT first N rows, where concatenation of the field encoded with some delimiter like

&q=%D0%94%D0%BB%D0%B8%D0%BD%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8&q=%D0%95%D1%89%D1%91+%D0%BE%D0%B4%D0%BD%D0%B0+%D0%BD%D0%B5%D0%BF%D0%BE%D0%BD%D1%8F%D1%82%D0%BD%D0%B0%D1%8F+%D1%84%D1%80%D0%B0%D0%B7%D0%B0+%D0%BF%D0%BE-%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8

so the length of this concatenated string is not more than (2000) chars. So I will have that string, and also all ids of those rows, included in url (in the same order, sure).

And then I want to select next N rows with the same criteria, and so on.

Can you edit your question and provide sample data and desired results? — Gordon Linoff, Apr 25 '17 at 14:32
if the length exceeds quota, I push result into function, which will send/receive slow answers from http API, and start over with the next part of my table. — Dimitri, Apr 25 '17 at 15:09
I am sorry, I was inaccurate in definition. Sure that request will go inside FOR IN SELECT LOOP. — Dimitri, Apr 25 '17 at 15:14
`encoded` is not defined NOT NULL? Where does the leading `http://google.translation-api.com/api/v2&q=` come from? Does it count against the 2000 char limit? What's your version of Postgres? — Erwin Brandstetter, Apr 25 '17 at 15:28
Sorry dimitri but your question isnt clear. Try to focus in the main problem you need help. [**How to create a Minimal, Complete, and Verifiable example**](http://stackoverflow.com/help/mcve) If the problem is just the length try something simple ... Instead of using strings with lot of `%%%` or rusian characters. Include sample data and expected output. — Juan Carlos Oropeza, Apr 25 '17 at 15:28
In my question I have already sample data. Expected output is exactly this long url with lots of %, and list of ids. Juan Carlos, did you read my full question, or just looked at the end? About the char limit - this does not matter exactly. I know exactly the length of the static part of url. — Dimitri, Apr 25 '17 at 15:38
Erwin, the *encoded* field is always a quoted-printable version of the *q* field. There is a trigger under hood which encodes it. You can omit the leading part of url, I wrote it just for clearance how this will be used then. Postgres version is 9.6.1 — Dimitri, Apr 25 '17 at 15:45
I try read several time but dont see the problem. Again if you simplifiy the problem is easy to understand and solve. Check this [**question**](http://stackoverflow.com/questions/43331682/add-a-semi-sequential-number-to-rows-based-on-group-by-and-id-then-mark-newes/43331974#43331974) for example. Im sure OP problem isnt about Cats or Apples, But have a data table in a readable format another table with the desire output. Then check my answer. Using those table I create a small working [**DEMO**](http://rextester.com/ETAP80996) to show the solutions. But I cant do that without a sample data. — Juan Carlos Oropeza, Apr 25 '17 at 15:52
I took the liberty to clarify your table & data somewhat accordingly. — Erwin Brandstetter, Apr 25 '17 at 16:19

score 1 · Accepted Answer · edited May 23 '17 at 12:02

You can do it with a smart recursive CTE:

WITH RECURSIVE c AS ( -- 1st CTE is not recursive
   SELECT dense_rank()  OVER (ORDER BY     source, target, profile)             AS rnk
        , row_number()  OVER (PARTITION BY source, target, profile ORDER BY id) AS rn
        , lead(encoded) OVER (PARTITION BY source, target, profile ORDER BY id) AS next_enc
        , id, encoded
   FROM   cache
   )

 , rcte AS (  -- "recursion" starts here
   SELECT rnk, rn, ARRAY[id] AS ids, encoded AS url
        , CASE WHEN length(concat_ws('&q=', encoded || next_enc)) > 2000  -- max len
                 OR next_enc IS NULL  -- last in partition
               THEN TRUE END AS print
   FROM   c
   WHERE  rn = 1

   UNION ALL
   SELECT c.rnk, c.rn
        , CASE WHEN r.print THEN ARRAY[id] ELSE r.ids || c.id                      END AS ids
        , CASE WHEN r.print THEN c.encoded ELSE concat_ws('&q=', r.url, c.encoded) END AS url
        , CASE WHEN length(
             CASE WHEN r.print THEN concat_ws('&q=', c.encoded, c.next_enc)
                  ELSE concat_ws('&q=', r.url, c.encoded, c.next_enc) END) > 2000  -- max len
                 OR c.next_enc IS NULL  -- last in partition
               THEN TRUE END AS print
   FROM   rcte r
   JOIN        c USING (rnk)
   WHERE  c.rn = r.rn + 1
   )
SELECT ids, url
FROM   rcte
WHERE  print
ORDER  BY rnk, rn;

About the rCTE including a non-recursive CTE:

Multiple CTE in single query

But this is probably one of the rare cases where looping in a plpgsql function is actually faster.

See this related answer for more explanation:

Grouping or Window

Thank you, Erwin, I never heard about CTE. It looks much more complex, than plain loop. Exactly your request returned dublicates. ids, url: {1,2,3,4,18,19,21,22,23,25,37}, '%%%here long string'; {1,2,3,4,18,19,21,22,23,25,37,38}, '%%%previous string + some data'; {1,2,3,4,18,19,21,22,23,25,37,38,39}, '%%%and so on'; — Dimitri, Apr 25 '17 at 16:40
@Dimitri: Sorry, I was in a hurry and forgot to restart aggregates after printing. Fixed now. Anyway, this is merely a proof of concept. I am almost certain that looping over the tables is faster and simpler for this particular case. — Erwin Brandstetter, Apr 26 '17 at 04:26

Selecting first N rows, where sum of lengths of TEXT field is up to some limit

1 Answers1