39

Ultimately I want to have a scalable search solution for the data in PostgreSql. My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution. The soluions I have found involve using jdbc-input to query all data from Postgres on an interval, and the delete events are not captured.

I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.

Khanetor
  • 11,595
  • 8
  • 40
  • 76
  • 1
    This answer might help (towards the end where I mention the Go/Python binlog solutions): http://stackoverflow.com/questions/34477095/elasticsearch-replication-of-other-system-data/34477639#34477639 – Val Mar 05 '16 at 12:44
  • Thanks Val for the pointer. I found that post prior to asking this question. However I think now I can use the input-jdbc and do a little extra work to handle deleted data. Do you know how to setup Logstash to capture events from Postgres? – Khanetor Mar 05 '16 at 16:33
  • Uhm... I think that to work around the DELETE issue, I would not ever delete any record in my main database, and simply mark the records as `deleted`, and it will be reflected in ElasticSearch. When I feel like deleting stuff, I set a schedule to delete in both the main database and ES to delete everything with the deleted mark. – Khanetor Mar 05 '16 at 16:47
  • Yep, I was going to suggest the latter approach as well, just mark as deleted and update your timestamp, and the jdbc input will take care of it. – Val Mar 05 '16 at 16:50
  • I would be happy to mark this as an answer if you are willing to compose it :) – Khanetor Mar 05 '16 at 17:21

3 Answers3

20

If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help. You'd have to use a solution working around the binlog as suggested here

However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, i.e. create a new BOOLEAN column in order to mark your records as deleted. The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple term query on the deleted field.

Whenever you need to perform some cleanup, you can delete all records flagged deleted in both PostgreSQL and Elasticsearch.

Community
  • 1
  • 1
Val
  • 207,596
  • 13
  • 358
  • 360
15

You can also take a look at PGSync.

It's similar to Debezium but a lot easier to get up and running.

PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch. It allows you to keep Postgres as your source-of-truth and expose structured denormalized documents in Elasticsearch.

You simply define a JSON schema describing the structure of the data in Elasticsearch.

Here is an example schema: (you can also have nested objects)

e.g

{
    "nodes": {
        "table": "book",
        "columns": [
            "isbn",
            "title",
            "description"
        ]
    }
}

PGsync generates queries for your document on the fly. No need to write queries like Logstash. It also supports and tracks deletion operations.

It operates both a polling and an event-driven model to capture changes made to date and notification for changes that occur at a point in time. The initial sync polls the database for changes since the last time the daemon was run and thereafter event notification (based on triggers and handled by the pg-notify) for changes to the database.

It has very little development overhead.

  • Create a schema as described above
  • Point pgsync at your Postgres database and Elasticsearch cluster
  • Start the daemon.

You can easily create a document that includes multiple relations as nested objects. PGSync tracks any changes for you.

Have a look at the github repo for more details.

You can install the package from PyPI

taina
  • 301
  • 3
  • 6
  • Hi, and welcome! Do you think you could expand on this a little bit, with some more info on what PGSync does and how it would help solve the original question? The best answers have minimal reliance on external links for support. – Z4-tier Jan 18 '20 at 18:23
  • PGSync is a great piece of kit. It works well. You can even specify child tables/columns and relationships and have them merge into a single index. – Damien Roche Oct 14 '20 at 17:58
  • PGSync is not working that well for us. When we create a new column in an existing table with data, it doesn't migrate data to ES. Should look for alternatives then? – Khushal Vyas Jan 31 '23 at 09:32
  • @KhushalVyas Can you share more details on this or open a GitHub issue? Normally, when you add a new column to the schema.json you should ideally re-index. – taina Feb 01 '23 at 11:15
7

Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to steam your data

I created a simple github repository, which shows how it works

enter image description here

Yegor
  • 3,652
  • 4
  • 22
  • 44