I have a basic HA setup for Logstash - two identical nodes in two separate AWS availability zones. Each node runs a pipeline that extracts a dataset from DB cluster and then outputs it downstream it to ELasticSearch cluster for indexing. This works fine with one Logstash node, but two nodes running in parallel send the same data twice down to ES for indexing due to the fact that each node tracks :sql_last_value
separately. Since I use the same ID as the document ID across both nodes, all repeated data is simply updated instead of being inserted twice. In other words, there is 1 insert and 1 update per each dataset. This is, obviously, not very efficient and puts unnecessary load on ELK resources. It gets worse as additional Logstash nodes are added.
Does anyone know a better way of how parallel Logstash nodes should be set up, so each node doesn’t extract the same dataset if it’s been already extracted by another previous node? One poor man’s solution could be creating a shared NFS folder between Logstash nodes and having each node write :sql_last_value
there, but I am not sure what kind of side effect I may run into with this setup, especially under higher loads. Thank you!