8

I'm quite new to databases, and am looking for some high level advice.

The Situation
I'm building a database using Postgres 9.3, within the database is a table in which I store log files.

CREATE TABLE errorlogs (
     id SERIAL PRIMARY KEY,
     archive_id INTEGER NOT NULL REFERENCES archives,
     filename VARCHAR(256) NOT NULL,
     content TEXT);

The text in content can vary in length anywhere from 1k to 50MB.

The Problem
I'd like to be able to perform reasonably fast text searches on the data within the "content" column (eg, WHERE CONTENT LIKE '%some_error%'). Right now the searches are very slow (>10 minutes to search through 8206 rows).

I know that indexing is intended to be the solution to my problem, but I don't seem to be able to create indexes -- whenever I try I get errors that the index would be too large.

=# CREATE INDEX error_logs_content_idx ON errorlogs (content text_pattern_ops);
ERROR: index row requires 1796232 bytes, maximum size is 8191

I was hoping for some advice on how to get around this problem. Can I change the maximum index size? Or should I not be trying to use Postgres for full text search on text fields as large as this?

Any advice is much appreciated!

JBeFat
  • 907
  • 10
  • 20
  • 2
    I think you are probably looking for full-text search/indexing http://www.postgresql.org/docs/9.1/static/textsearch-intro.html. – John Powell Jan 15 '15 at 17:59
  • 1
    This answer might also help, http://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations/13452528#13452528 – John Powell Jan 15 '15 at 18:00
  • Hi John, thanks for the advice. I've already gone through the textsearch docs, and I couldn't find any information on the index limitations. The second comment you posted describes creating a text_pattern_ops index, which as I mentioned above, returns an error about the index being too large. – JBeFat Jan 15 '15 at 18:11
  • You want to be using a gin or gist index, rather than a B-tree, which is what text_pattern_ops work on. Apologies for bad link. – John Powell Jan 15 '15 at 18:21
  • I've tried that too, I get similar errors about the data being too large for a tsvector using something like: WHERE to_tsvector(content) @@ to_tsquery('Stacktrace') (with no index) – JBeFat Jan 15 '15 at 18:23
  • 1
    I get NOTICES, not ERRORS, when using to_tsvector on strings that contain very large words. Although if your content contains words over 2047 characters long, I have to wonder if it is not really binary data, which could mean it can have \0 characters which will cause problems. – jjanes Jan 15 '15 at 22:12
  • The data I am indexing is definitely not binary. They are very large log files, and there is a chance there could be a "word" over 2047 characters long. Do you happen to have a link to information on these limits? It would be really helpful... – JBeFat Jan 16 '15 at 14:24
  • I don't know that the 2047 limit is documented anywhere. I am just going empirically off from what I see with `select to_tsvector('simple',(select string_agg(md5(x::text),'') from generate_series(1,1000) x(x))||' Stacktrace' );` – jjanes Jan 16 '15 at 22:01

1 Answers1

4

Text search vectors can't handle data this big --- see documented limits. Their strength is fuzzy searching, so you can search for 'swim' and finding 'swim,' 'swimming,' 'swam,' and 'swum' in the same call. They are not meant to replace grep.

The reason for the limits are in the source code as MAXSTRLEN (and MAXSTRPOS). Text search vectors are stored in one long, continuous array up to 1 MiB in length (total of all characters for all unique lexemes). To access these, the ts_vector index structure allows 11 bits for word length and 20 bits for its position in the array. These limits allow the index structure fit into a 32-bit unsigned int.

You are probably running into one or both of these limits if you have either too many unique words in a file OR words are repeated very frequently --- something quite possible if you have 50MB log file with quasi-random data.

Are you sure you need to store log files in a database? You're basically replicating the file system, and grep or python can do the searching there quite nicely. If you really need to, though, you might consider this:

CREATE TABLE errorlogs (
    id SERIAL PRIMARY KEY
    , archive_id INTEGER NOT NULL REFERENCES archives
    , filename VARCHAR(256) NOT NULL
);

CREATE TABLE log_lines (
    line PRIMARY KEY
    , errorlog INTEGER REFERENCES errorlogs(id)
    , context TEXT
    , tsv TSVECTOR
);

CREATE INDEX log_lines_tsv_idx ON log_lines USING gin( line_tsv );

Here, you treat each log line as a "document." To search, you'd do something like

SELECT e.id, e.filename, g.line, g.context
FROM errorlogs e JOIN log_lines g ON e.id = g.errorlog 
WHERE g.tsv @@ to_tsquery('some & error');
afs76
  • 98
  • 7
  • Thanks very much for the suggestion. I have since switched to storing a single log line per row. I haven't yet tried indexing though -- just tried this and it works nicely. Thanks again! – JBeFat Apr 26 '15 at 17:37