Questions tagged [elasticsearch-hadoop]

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Elasticsearch real-time search and analytics natively integrated with Hadoop.

Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Requirements

Elasticsearch (0.9X series or 1.0.0 or higher (highly recommended)) cluster accessible through REST. That's it! Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you're set. For a certain library, see the dedicated chapter.

Documentation

109 questions
10
votes
1 answer

Pypsark - Retain null values when using collect_list

According to the accepted answer in pyspark collect_set or collect_list with groupby, when you do a collect_list on a certain column, the null values in this column are removed. I have checked and this is true. But in my case, I need to keep the…
10
votes
2 answers

ElasticSearch to Spark RDD

I was testing ElasticSearch and Spark integration on my local machine, using some test data loaded in elasticsearch. val sparkConf = new SparkConf().setAppName("Test").setMaster("local") val sc = new SparkContext(sparkConf) val conf = new…
7
votes
1 answer

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark. Here is the code: JavaSparkContext sc = new JavaSparkContext( new SparkConf().setAppName("MySQLtoEs") .set("es.index.auto.create",…
eliasah
  • 39,588
  • 11
  • 124
  • 154
6
votes
1 answer

Deploy Elasticsearch for Apache Spark on Kubernetes

I'm wondering if anyone has experience configuring a Kubernetes cluster using the Elasticsearch for Hadoop library. I'm running into issues with the node discovery timing out when trying to write from spark to elasticsearch. I have Elasticsearch up…
6
votes
1 answer

How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?

Original title: Besides HDFS, what other DFS does spark support (and are recommeded)? I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters. From time to time, I would like to pull the entire…
5
votes
3 answers

Python spark Dataframe to Elasticsearch

I can't figure out how to write a dataframe to elasticsearch using python from spark. I followed the steps from here. Here is my code: # Read file df = sqlContext.read \ .format('com.databricks.spark.csv') \ .options(header='true') \ …
dimzak
  • 2,511
  • 8
  • 38
  • 51
5
votes
1 answer

Elasticsearch-Hadoop library cannot connect to to docker container

I have spark job that reads from Cassandra, processes/transforms/filters the data, and writes the results to Elasticsearch. I use docker for my integration tests, and I am running into trouble of writing from spark to…
5
votes
1 answer

Fail to apply mapping on an RDD on multipe spark nodes through Elasticsearch-hadoop library

import org.elasticsearch.spark._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.serializer._; import com.esotericsoftware.kryo.Kryo; import org.elasticsearch.spark.rdd.EsSpark sc.stop() val conf = new…
AmirHd
  • 10,308
  • 11
  • 41
  • 60
5
votes
3 answers

elasticsearch-spark connector size limit parameter is ignored in query

I'm trying to query elasticsearch with the elasticsearch-spark connector and I want to return only few results: For example: val conf = new SparkConf().set("es.nodes","localhost").set("es.index.auto.create", "true").setMaster("local") val…
Udy
  • 2,492
  • 4
  • 23
  • 33
4
votes
2 answers

Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll

We are trying to integrate ES (1.7.2, 4 node cluster) with Spark (1.5.1, compiled with hive and hadoop with scala 2.11, 4 node cluster), there is hdfs coming into equation (hadoop 2.7,4 nodes) and thrift jdbc server and…
4
votes
1 answer

What is ElasticSearch-Hadoop (es-hadoop) and its benefit over HBase for a live web application?

It is not entirely clear to me what es-hadoop is from the description. Is this merely a "connector" that will move data over from your ES cluster to HDFS for Hadoop analytics? If so, why not just go with HBase for low-latency text queries? Is…
ElHaix
  • 12,846
  • 27
  • 115
  • 203
3
votes
1 answer

Spark 2.4 to Elasticsearch : prevent data loss during Dataproc nodes decommissioning?

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster. We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google Dataproc cluster (autoscaling enabled). During the execution, if the…
3
votes
2 answers

what does load() do in spark?

spark is lazy right? so what does load() do? start = timeit.default_timer() df = sqlContext.read.option( "es.resource", indexes ).format("org.elasticsearch.spark.sql") end = timeit.default_timer() print('without load: ', end - start) #…
eugene
  • 39,839
  • 68
  • 255
  • 489
3
votes
1 answer

Spark writing to Elasticsearch slow performance

I seem to have hit a problem in which Spark writing to Elasticsearch is very slow and it takes quite a lot of time (around 15 mins) in making the initial connection, during which both Spark and Elasticsearch remain idle. There is another thread…
waleed ali
  • 1,175
  • 10
  • 23
3
votes
1 answer

Elastisearch-Hadoop how to do a bulk search in spark program

I am writing a spark program which is basically a RDD of Strings. What i need to to do is basically create a query per string and do the query based on Elastic search index. So essentially Query would differ on string. I wanted to use…
1
2 3 4 5 6 7 8