1

Begineer SparkR and ElasticSearch question here!

How do I write a sparkR dataframe or RDD to ElasticSearch with multiple nodes?

There exists a specific R package for elastic but says nothing about hadoop or distributed dataframes. When I try to use it I get the following errors:

install.packages("elastic", repos = "http://cran.us.r-project.org")
library(elastic)
df <- read.json('/hadoop/file/location')
connect(es_port = 9200, es_host = 'https://hostname.dev.company.com', es_user = 'username', es_pwd = 'password')
docs_bulk(df)

Error: no 'docs_bulk' method for class SparkDataFrame

If this were pyspark, I would use the rdd.saveAsNewAPIHadoopFile() function as shown here, but I can't find any information about it in sparkR from googling. ElasticSearch also has good documentation, but only for Scala and Java

I'm sure there is something obvious I am missing; any guidance appreciated!

whs2k
  • 741
  • 2
  • 10
  • 19
  • 2
    `elastic` maintainer here: that doesn't exist in the R client yet. please open an issue at https://github.com/ropensci/elastic/issues/ to discuss; we can see whether you https://cran.rstudio.com/web/packages/sparklyr/ is the place to discuss this or if we can address with `elastic` pkg - the Elastic website docs are probably only for officially supported clients, which R is not one of them – sckott Mar 06 '18 at 23:27
  • Hello @sckott ! I suggest that you write your comment as an answer of a wiki answer. – eliasah Mar 07 '18 at 07:43
  • @sckott I just opened issue #213 on github.com/ropensci/elastic/issues – whs2k Mar 07 '18 at 14:45
  • @eliasah thanks, but i'm not sure what a wiki answer is? – sckott Mar 07 '18 at 18:09

1 Answers1

2

to connect your SparkR session to Elasticsearch you need to make the connector jar and your ES configuration available to your SparkR session.

1: specifiy the jar (look up which version you need in the elasticsearch documentation; the below version is for spark 2.x, scala 2.11 and ES 6.8.0)

sparkPackages <- "org.elasticsearch:elasticsearch-spark-20_2.11:6.8.0"

2: specify your cluster config in your SparkConfig. You can add other Elasticsearch config here, too (and, of course, additional spark configs)

sparkConfig <- list(es.nodes = "your_comma-separated_es_nodes",
                    es.port = "9200")
  1. initiate a sparkR session
sparkR.session(master="your_spark_master", 
               sparkPackages=sparkPackages, 
               sparkConfig=sparkConfig)

  1. do some magic that results in a sparkDataframe you want to save to ES

  2. write your dataframe to ES:

write.df(yourSparkDF, source="org.elasticsearch.spark.sql",
                 path= "your_ES_index_path"
         )
Janna Maas
  • 1,124
  • 10
  • 15