Number of connections to ElasticSearch from Java Spark

Question

We're writing data to ElasticSearch using Spark Streaming and Java through the saveAsNewAPIHadoopFile method from JavaPairRDD (Spark 1.6.0). This all works perfectly well both locally and on a cluster. However, we do notice the number of connections to ElasticSearch growing very quickly (as can be seen from http://localhost:9200/_nodes/stats/http/_all?pretty for a run a local system), eventually leading to ElasticSearch to become very slow. It seems that for each RDD a new connection is setup and it looks like being closed again as well. Is it possible to open a connection and keep it open as long as possible, or at least for a considerable long time? We are using Spark 1.6.0 as mentioned and ElasticSearch 2.0.0.

score 0 · Answer 1 · answered Sep 15 '16 at 12:58

0

Yes, if you're creating a connection in your foreachRDD, a connection is created on each RDD. You should use connection pooling. This is detailed extensively in the doc:

https://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

spark-streaming and connection pool implementation

answered Sep 15 '16 at 12:58

Francois G

11,957
54
59

Thanks, good point which I didn't really think too much about, probably because the saveAsNewAPIHadoopFile seems to take care of establishing a connection (and closing as well). So, if I want to use a nodeBuilder to create a client (some kind of singleton I guess), how do I use this client with this saveAsNewAPIHadoopFile method? – Martijn Kamstra Sep 15 '16 at 16:47
Oh wait I kind of missed your second link. Will try an approach like that. Will let you know tomorrow if that worked. – Martijn Kamstra Sep 15 '16 at 16:49
I somehow don't seem to get this to work (still don't see how the saveAsNewAPIHadoopFile uses a node that is created as it seems to do everything 'under the hood'). So trying the TransportClient instead. Now I'm running into an exception which according to http://stackoverflow.com/questions/33544863/java-elasticsearch-client-always-null is related to conflicting Guava versions (probably at runtime since I'm able to compile everything) but haven't been able to solve that yet either. – Martijn Kamstra Sep 16 '16 at 13:27

Number of connections to ElasticSearch from Java Spark

1 Answers1