8

Lets say I have a book represented like this:

{ title: "Tis no book"
  chapter: 1,
  text: "Hello world this is a book chapter",
  subchapters: [
     {
     chapter: 1.1
     text: "Nested sub chapter"
     subchapters: [
       chapter: 1.1.1
       text: "Nested nested..."
       subchapters: [ ...etc...]
     },
     {
     chapter: 1.2
     text: "Nested sub chapter 2"
     subchapters: [ ...etc...]
     }
   ]
}

Can I use postgres' 9.4 new jsonb/gin (or anything else) to set up a full text index on the "text" field of this recursively nested data structure such that a library of books stored to the database can be searched by text (using the index)?

gremwell
  • 1,419
  • 17
  • 23

3 Answers3

4

I have just started looking into full-text search and jsonb types. It looks like this is possible once you understand how the indexes work on JSONB types. I have found this blog series to be very helpful.

https://bibhas.in/blog/postgresql-swag-part-2-indexing-json-data-type-and-full-text-search/

Also, the documentation on the JSON type at Postgres contains some good insights. http://www.postgresql.org/docs/9.4/static/datatype-json.html

  • Unless I'm mistaken, this only shows how to do a fulltext search on a specific key-value (i.e. a specific depth in the tree). I think the problem needing answered for the OP (and why I found this question) is the recursive nested part. – claytond Apr 19 '18 at 15:19
2

One option -- and likely the best -- is to create a recursive function (official documentation on functions) that combines the text values of the entire JSONB structure into a single string (which it returns). You can full-text index the output of that function (see @thomas-wayne-shelton's answer) as if it were just a string. So long as you use the same function in your WHERE conditions, pgSQL should recognize the index.

I just built/tested a similar "flattening" function so I'm confident that it can be done. Unfortunately, my case was very different so I suspect the code would confuse rather than enlighten. I can say that the function(s) must be marked as IMMUTABLE to support indexing.

===

The other path that looks promising (but I believe a dead-end) is a recursive CTE.

It is certainly possible to unwrap the recursive data using a recursive CTE. Here's the official documentation, a SO answer, and a blog example -- the last two for recursive nested JSONB specifically.

However, I don't believe you can index this output. It's my (tentative) understanding that a query using a recursive CTE is a lot like a view. It's a runtime optimization. I no longer have reference on-hand but recall seeing recent (as of Apr. 2018) discussion that indexing views was still a long way off (and I'm not even sure an index on a view would actually work for a CTE).

claytond
  • 1,061
  • 9
  • 22
-1

I won't answer the question; instead, I'm going to suggest a completely different approach.

Have you taken a look at Lucene, https://lucene.apache.org/core/? Implementing full-text search in the database is bad design. You should be using a separate full-text indexer such as Lucene. The Lucene documents should refer to the database keys which can be used to look up the real record in the database.

By using Lucene, you will most likely get much better performance than by using the full-text search features of the database. Furthermore, Lucene is far easier to scale out than your database.

juhist
  • 4,210
  • 16
  • 33
  • 1
    Yeh for sure it is definitely the solution that first comes to mind, but I was reading some articles suggesting that full text indexing in postgres is now quick (https://wiki.postgresql.org/images/2/25/Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf) and wondering if this is at all possible. But thanks for the reality check. – gremwell Mar 23 '15 at 12:00
  • 2
    Well, if you decide to use the full text indexing of PostgreSQL, good luck in the path you choose! One solution would be to concatenate the "text" fields into a separate column and use full-text indexing on that column. In results in some data duplication, but hey, disk space is cheap these days. – juhist Mar 23 '15 at 12:02