1

I'm wondering about the ways to connect a Spark app to Pivotal HD, a Hadoop implementation.

What is the best way to connect to it using Spark?

val jdbcDataFrame = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load()
Greg Chase
  • 173
  • 8
BAR
  • 15,909
  • 27
  • 97
  • 185

1 Answers1

1

I see your question has been edited but I'll try and answer all of your queries.

Pivotal HD (Greenplum HD as it used to be called) is a Hadoop distro so you should use it like any Hadoop/HDFS distro. Specifically:

text_file = spark.textFile("hdfs://...")

Or for running jobs via YARN, see:

http://spark.apache.org/docs/latest/running-on-yarn.html

Greenplum DB (distributed Postgres) does not back Pivotal HD. The exception is if you're referring to Pivotal HAWQ, which is effectively Greenplum DB on top of HDFS.

Greenplum was a company that built Greenplum DB and Greenplum HD that was acquired by EMC. EMC then grouped several businesses into the 'Pivotal Initiative', which rebranded Greenplum DB as 'Pivotal Greenplum Database' and Greenplum HD as 'Pivotal HD'.

Paul
  • 1,874
  • 1
  • 19
  • 26
  • This makes it a tough decision. I want to go with the best solution - OTOH I have TBs of structured and partitioned data that fits Greenplum perfectly. The problem is I need to process TB of data at a time. I found myself almost reimplementing MapReduce or at least its functionality, so that the data would fit in memory. It is undoubtably going to be more performant to use MapReduce on the DB side, but is it not less performant to use something other than Greenplum for this structured data? – BAR Sep 11 '15 at 19:58
  • 1
    I figure this would best be left for another question: http://stackoverflow.com/questions/32531383/greenplum-pivotal-hd-spark-or-hawq-for-tbs-of-structured-data – BAR Sep 11 '15 at 20:09