1

I have a table with millions of text documents in a Postgres 14.7 DB. When I try to generate a GIN index I get this error:

create index idx__txt_document__doc_text__gin on txt_document using gin(to_tsvector('simple', doc_text));
ERROR:  string is too long for tsvector (4040510 bytes, max 1048575 bytes)
Time: 6221.251 ms (00:06.221)

Is there a way to find out the offending text id?

I get the same error when I create a temp table that contains the text id and the length of document's tsvector.

create temp table tmp_foo as
select id, length(to_tsvector('simple', doc_text))
from txt_document;
ERROR:  string is too long for tsvector (4040510 bytes, max 1048575 bytes)
CONTEXT:  parallel worker
Time: 1912.090 ms (00:01.912)

Any idea how to get the text that creates the error?

I found the largest document which is about 7MB but creating the tsvector works just fine.

select length(to_tsvector('simple', doc_text)) from txt_document where id = ID_LARGEST_TEXT;
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
chhenning
  • 2,017
  • 3
  • 26
  • 44
  • can you suply some data with your table? – nbk Aug 29 '23 at 20:33
  • It is plain text generated from HTML files. Does that answer your question? – chhenning Aug 29 '23 at 20:37
  • no, to reproduce the problem it needs a [mre], so some rows are needed, as the pr4oblem is the text in the table – nbk Aug 29 '23 at 20:39
  • That is what I'm asking. ;-) Is there a way to find the text that is throwing the error? For instance, I have randomly selected 100,000 documents and I was able to generate the index. – chhenning Aug 29 '23 at 20:43
  • how about not use ts_vector when searching for length, but only text – nbk Aug 29 '23 at 20:47
  • Not sure what you mean. "Searching for length" of what? – chhenning Aug 29 '23 at 20:57
  • i mean length('simple', doc_text) to chekc for vaalues that have more than 1,800.000 bytes – nbk Aug 29 '23 at 21:23
  • I don't think there is a very a good of doing that. I think you would need to do something like a binary search. – jjanes Aug 29 '23 at 22:03

2 Answers2

3

You can loop through your table, catch the exception, and RAISE a notice (or whatever):

DO
$do$
DECLARE
   r record;
BEGIN
   FOR r IN
      SELECT id, doc_text FROM tbl
   LOOP
      PERFORM to_tsvector('simple', r.doc_text);
   END LOOP;
   
   -- Force error for debugging:
   -- PERFORM to_tsvector('simple', string_agg('longwordnr' || g, ' ')) FROM generate_series (1, 100000) g;  
EXCEPTION
   WHEN program_limit_exceeded THEN
-- WHEN SQLSTATE '54000' THEN  -- the same with error code
      RAISE NOTICE 'Row with this "id" exceeds ts_vector length: %', r.id;
END
$do$

Using a DO command to execute a PL/pgSQL code block quickly.-

I got the error code from provoking the same error. Postgres error messages by default add this line:

...
SQL state: 54000

Your client seems to suppress it, or you did not include it in the question.
About Postgres error codes.

About the EXCEPTION clause:

Note that duplicative lexemes in input strings are stored "compressed" in a tsvector. Hence, the longest string does not necessarily produce the longest tsvector. Consider this demo:

fiddle

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

It is a bit annoying, but you could create a wrapper function which catches the error and converts it to NULL (or some other sentinel value):

create or replace function jjj_to(text) returns tsvector language plpgsql as $$
  begin 
    return to_tsvector('simple',$1); 
  exception when others then 
    return null; 
  end $$;

Then:

select id from txt_document where jjj_to(doc_text) is null;
jjanes
  • 37,812
  • 5
  • 27
  • 34