We're writing data to ElasticSearch using Spark Streaming and Java through the saveAsNewAPIHadoopFile method from JavaPairRDD (Spark 1.6.0). This all works perfectly well both locally and on a cluster. However, we do notice the number of connections to ElasticSearch growing very quickly (as can be seen from http://localhost:9200/_nodes/stats/http/_all?pretty for a run a local system), eventually leading to ElasticSearch to become very slow. It seems that for each RDD a new connection is setup and it looks like being closed again as well. Is it possible to open a connection and keep it open as long as possible, or at least for a considerable long time? We are using Spark 1.6.0 as mentioned and ElasticSearch 2.0.0.
Asked
Active
Viewed 348 times
1 Answers
0
Yes, if you're creating a connection in your foreachRDD
, a connection is created on each RDD. You should use connection pooling. This is detailed extensively in the doc:

Francois G
- 11,957
- 54
- 59
-
Thanks, good point which I didn't really think too much about, probably because the saveAsNewAPIHadoopFile seems to take care of establishing a connection (and closing as well). So, if I want to use a nodeBuilder to create a client (some kind of singleton I guess), how do I use this client with this saveAsNewAPIHadoopFile method? – Martijn Kamstra Sep 15 '16 at 16:47
-
Oh wait I kind of missed your second link. Will try an approach like that. Will let you know tomorrow if that worked. – Martijn Kamstra Sep 15 '16 at 16:49
-
I somehow don't seem to get this to work (still don't see how the saveAsNewAPIHadoopFile uses a node that is created as it seems to do everything 'under the hood'). So trying the TransportClient instead. Now I'm running into an exception which according to http://stackoverflow.com/questions/33544863/java-elasticsearch-client-always-null is related to conflicting Guava versions (probably at runtime since I'm able to compile everything) but haven't been able to solve that yet either. – Martijn Kamstra Sep 16 '16 at 13:27