5

I have a basic HA setup for Logstash - two identical nodes in two separate AWS availability zones. Each node runs a pipeline that extracts a dataset from DB cluster and then outputs it downstream it to ELasticSearch cluster for indexing. This works fine with one Logstash node, but two nodes running in parallel send the same data twice down to ES for indexing due to the fact that each node tracks :sql_last_value separately. Since I use the same ID as the document ID across both nodes, all repeated data is simply updated instead of being inserted twice. In other words, there is 1 insert and 1 update per each dataset. This is, obviously, not very efficient and puts unnecessary load on ELK resources. It gets worse as additional Logstash nodes are added.

Does anyone know a better way of how parallel Logstash nodes should be set up, so each node doesn’t extract the same dataset if it’s been already extracted by another previous node? One poor man’s solution could be creating a shared NFS folder between Logstash nodes and having each node write :sql_last_value there, but I am not sure what kind of side effect I may run into with this setup, especially under higher loads. Thank you!

demisx
  • 7,217
  • 4
  • 45
  • 43
  • Looks like you found the answer in feature request from 2015. https://github.com/elastic/logstash/issues/2632 – Alain Collins Mar 11 '19 at 18:18
  • No, I can't tell I found the answer to this. Still sending duplicate data to ES from each logstash node. – demisx Mar 14 '19 at 18:22
  • I believe that the answer is that there's a 4-year old feature request that hasn't been addressed. – Alain Collins Mar 14 '19 at 23:02
  • I don't see how this qualifies as an answer. – demisx Mar 15 '19 at 03:52
  • It's not an answer - it's a comment. It does, however, serve as a pointer to the information that shows the current status of the answer and may someday - if you're a real optimist - show the solution. If anyone else finds your question here, they will have more information than they had without the link. – Alain Collins Mar 16 '19 at 17:49
  • Those threads go without any solution for years sometimes. I am hoping maybe someone found a stable workaround for this issue and would be wiling to share. – demisx Mar 17 '19 at 17:31

2 Answers2

3

We have the very same scenario: 3 logstash instances to ensure high availablility with serveral databases as data sources.

On each logstash instance install and enable the same jdbc-pipelines following this logic:

  • find a unique identifier in your result set for each document (primary key etc.) or generate a fingerprint using the fields in the result (MD5, SHA, not UUID). This identifier needs to be stable! It have to be the the same on other logstash nodes when the same entities are being returned.
  • in the elasticsearch-output use the id or the fingerprint as the document _id.

Here comes a simplified example for the easy case (id is part of the result set):

input{
  jdbc{
     ...
     statement => "select log_id, * from ..."
     ...
  }
}
filter{...}
output{
  elasticsearch{
    ...
    index => "logs-%{+YYYY.MM.dd}"
    document_id => "%{[log_id]}"
    ...
  }
}

And here comes the variant when your data lacks uniqe identifiers and you need to generate a fingerprint

input{
  jdbc{
     ...
     statement => "select * from ..."
     ...
  }
}
filter{
  fingerprint {
    method => "MD5"
    concatenate_all_fields => true
  }
}
output{
  elasticsearch{
    ...
    index => "logs-%{+YYYY.MM.dd}"
    document_id => "%{[fingerprint]}"
    ...
  }
}

In both ways, the documents will be created when they´re part of the resultset for one logstash instance. All other logstash instances will get the same documents at a later time. Using the id/fingerprint as _id will update the previously created documents instead of duplicationg your data.

Works well for us, give it a try!

ibexit
  • 3,465
  • 1
  • 11
  • 25
  • Thank you for your input. This is the same workaround we use right now. – demisx May 30 '19 at 04:01
  • I don't think this is a workaround but a wirking solution fully solving the HA requirement :) Would you mind to accept this answer then as it is a working solution you're also using. This will help others with the very same issue. Thx & Cheers – ibexit May 30 '19 at 16:28
  • You got it! Thank you for chiming in. Feel free to upvote the question in turn. – demisx May 30 '19 at 20:32
  • just a follow-up question, does it mean Logstash didn't have an official HA solution? – Nag Apr 01 '20 at 09:32
  • At least no build in, the same as for many other servers like apache httpd. You need to make it HA by applying common architecture patterns. – ibexit Apr 01 '20 at 10:05
  • @Nag Not that I am aware of. The described "workaround" seems to be the only way right now. – demisx Apr 08 '20 at 14:44
  • @ibexit but still it looks inefficient because all logstash instances will read the same data from input and send repeteadly to elastic search. – Lovin Jul 14 '22 at 07:15
1

I prefer having a common last_run_metadata_path (on NFS, or other shared file system) with certain offset in schedule parameters of different Logstash instances.

Please check the input jdbc plugin for additional detail on the last_run_metadata_path

Alexandre Juma
  • 3,128
  • 1
  • 20
  • 46
milosh3411
  • 21
  • 2