3

I seem to have hit a problem in which Spark writing to Elasticsearch is very slow and it takes quite a lot of time (around 15 mins) in making the initial connection, during which both Spark and Elasticsearch remain idle. There is another thread highlighting the same issue in the elastic community but it has been closed without any solution.

This is how I am writing from Spark to ES:

vgDF.write.format("org.elasticsearch.spark.sql").mode('append').option("es.resource", "demoindex/type1").option("es.nodes", "*ES IP*").save()

Spark specifications

Spark 2.1.0 3 cpu x 10 gb ram x 6 executors running on 3 gce nodesSpark 2.1.0

Elasticsearch specifications:

8 cpu * 30 gb RAM single node

ES Versions:

Elasticsearch: 6.2.2 ES-Hadoop: 6.2.2

For your information, Spark reads data from Cassandra DB, process the results (but this process is quite fast, takes around 1 - 2 mins) and then writes to Elasticsearch.

Any help would be greatly appreciated

[EDIT]

I have also tried varying the size of data from millions of records to just 960 records, but the initial delay is still the same (approx 15 mins).

waleed ali
  • 1,175
  • 10
  • 23

1 Answers1

2

Looks like ES connection is timing out. check if ES is accessible on the ip address you are providing. if you are using public IP, try changing it to private IP

Junaid
  • 1,004
  • 8
  • 24
  • 1
    Yeah, changed the public IP to private and it drastically reduced the ingestion time from 20 minutes to 12 seconds. Thanks mate! – waleed ali Apr 10 '18 at 06:27
  • Hi, I'm in the same situation but I don't know how to fix it. Currently I'm working on docker containers, one for ES and one for Spark. They are in different projects (docker-compose) and Spark is able to reach ES. Any suggestion? – GianAnge Nov 22 '19 at 13:14