Questions tagged [elasticsearch-hadoop]

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Elasticsearch real-time search and analytics natively integrated with Hadoop.

Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Requirements

Elasticsearch (0.9X series or 1.0.0 or higher (highly recommended)) cluster accessible through REST. That's it! Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you're set. For a certain library, see the dedicated chapter.

Documentation

109 questions

votes

1 answer

Pypsark - Retain null values when using collect_list

According to the accepted answer in pyspark collect_set or collect_list with groupby, when you do a collect_list on a certain column, the null values in this column are removed. I have checked and this is true. But in my case, I need to keep the…

asked Mar 20 '18 at 22:54

activelearner

7,055
20
53
94

votes

2 answers

ElasticSearch to Spark RDD

I was testing ElasticSearch and Spark integration on my local machine, using some test data loaded in elasticsearch. val sparkConf = new SparkConf().setAppName("Test").setMaster("local") val sc = new SparkContext(sparkConf) val conf = new…

serialization elasticsearch apache-spark elasticsearch-hadoop

asked Aug 11 '14 at 21:58

user3931226

votes

1 answer

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark. Here is the code: JavaSparkContext sc = new JavaSparkContext( new SparkConf().setAppName("MySQLtoEs") .set("es.index.auto.create",…

elasticsearch apache-spark elasticsearch-hadoop apache-spark-1.5

asked Sep 19 '15 at 10:21

eliasah

39,588
11
124
154

votes

1 answer

Deploy Elasticsearch for Apache Spark on Kubernetes

I'm wondering if anyone has experience configuring a Kubernetes cluster using the Elasticsearch for Hadoop library. I'm running into issues with the node discovery timing out when trying to write from spark to elasticsearch. I have Elasticsearch up…

hadoop elasticsearch apache-spark kubernetes elasticsearch-hadoop

asked Oct 27 '16 at 19:35

Aaron Duke

votes

1 answer

How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?

Original title: Besides HDFS, what other DFS does spark support (and are recommeded)? I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters. From time to time, I would like to pull the entire…

apache-spark elasticsearch hdfs elasticsearch-hadoop distributed-filesystem

asked Mar 12 '15 at 01:02

Winston Chen

6,799
12
52
81

votes

3 answers

Python spark Dataframe to Elasticsearch

I can't figure out how to write a dataframe to elasticsearch using python from spark. I followed the steps from here. Here is my code: # Read file df = sqlContext.read \ .format('com.databricks.spark.csv') \ .options(header='true') \ …

elasticsearch apache-spark pyspark elasticsearch-hadoop

asked Sep 18 '16 at 15:05

dimzak

2,511
8
38
51

votes

1 answer

Elasticsearch-Hadoop library cannot connect to to docker container

I have spark job that reads from Cassandra, processes/transforms/filters the data, and writes the results to Elasticsearch. I use docker for my integration tests, and I am running into trouble of writing from spark to…

scala elasticsearch apache-spark docker elasticsearch-hadoop

asked Aug 08 '16 at 19:15

Needs Help

votes

1 answer

Fail to apply mapping on an RDD on multipe spark nodes through Elasticsearch-hadoop library

import org.elasticsearch.spark._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.serializer._; import com.esotericsoftware.kryo.Kryo; import org.elasticsearch.spark.rdd.EsSpark sc.stop() val conf = new…

scala elasticsearch apache-spark rdd elasticsearch-hadoop

asked May 03 '16 at 08:10

AmirHd

10,308
11
41
60

votes

3 answers

elasticsearch-spark connector size limit parameter is ignored in query

I'm trying to query elasticsearch with the elasticsearch-spark connector and I want to return only few results: For example: val conf = new SparkConf().set("es.nodes","localhost").set("es.index.auto.create", "true").setMaster("local") val…

scala elasticsearch apache-spark elasticsearch-hadoop

asked Aug 12 '15 at 14:37

Udy

2,492
4
23
33

votes

2 answers

Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll

We are trying to integrate ES (1.7.2, 4 node cluster) with Spark (1.5.1, compiled with hive and hadoop with scala 2.11, 4 node cluster), there is hdfs coming into equation (hadoop 2.7,4 nodes) and thrift jdbc server and…

elasticsearch apache-spark apache-spark-sql elasticsearch-hadoop

asked Nov 13 '15 at 07:27

alobal

votes

1 answer

What is ElasticSearch-Hadoop (es-hadoop) and its benefit over HBase for a live web application?

It is not entirely clear to me what es-hadoop is from the description. Is this merely a "connector" that will move data over from your ES cluster to HDFS for Hadoop analytics? If so, why not just go with HBase for low-latency text queries? Is…

hadoop elasticsearch hbase elasticsearch-hadoop

asked Jul 30 '15 at 14:23

ElHaix

12,846
27
115
203

votes

1 answer

Spark 2.4 to Elasticsearch : prevent data loss during Dataproc nodes decommissioning?

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster. We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google Dataproc cluster (autoscaling enabled). During the execution, if the…

apache-spark elasticsearch google-cloud-dataproc elasticsearch-hadoop

asked Jan 21 '20 at 10:31

Fred Rouvier

votes

2 answers

what does load() do in spark?

spark is lazy right? so what does load() do? start = timeit.default_timer() df = sqlContext.read.option( "es.resource", indexes ).format("org.elasticsearch.spark.sql") end = timeit.default_timer() print('without load: ', end - start) #…

apache-spark elasticsearch-hadoop

asked Jun 29 '19 at 15:14

eugene

39,839
68
255
489

votes

1 answer

Spark writing to Elasticsearch slow performance

I seem to have hit a problem in which Spark writing to Elasticsearch is very slow and it takes quite a lot of time (around 15 mins) in making the initial connection, during which both Spark and Elasticsearch remain idle. There is another thread…

apache-spark elasticsearch pyspark elasticsearch-hadoop

asked Mar 20 '18 at 19:02

waleed ali

1,175
10
23

votes

1 answer

Elastisearch-Hadoop how to do a bulk search in spark program

I am writing a spark program which is basically a RDD of Strings. What i need to to do is basically create a query per string and do the query based on Elastic search index. So essentially Query would differ on string. I wanted to use…

hadoop apache-spark elasticsearch elasticsearch-hadoop

asked Sep 07 '17 at 02:18

Saurabh Sharma

2 3 4 5 6 7 8 Next