1

I am running a small Amazon EMR cluster and wish to write to its Hive database from a remote connection via JDBC. I am running into an error that also appears if I execute everything locally on that EMR cluster, which is why I think the fault is not the remote connection but something directly on EMR.

The error appears when running this minimal example:

connectionProperties = {
  "user" : "aengelhardt",
  "password" : "doot",
  "driver" : "org.apache.hive.jdbc.HiveDriver"
}
from pyspark.sql import DataFrame, Row
test_df = sqlContext.createDataFrame([
  Row(name=1)
])
test_df.write.jdbc(url= "jdbc:hive2://127.0.0.1:10000", table = "test_df", properties=connectionProperties, mode="overwrite")

I then get a lot of Java error messages, but I think the important lines are these:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 940, in jdbc
    self.mode(mode)._jwrite.jdbc(url, table, jprop)
  File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o351.jdbc.
: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:23 cannot recognize input near '"name"' 'BIGINT' ')' in column name or primary key or foreign key

The last line hints that something came up while creating the table, since he tries to specifiy the 'name' column as a 'BIGINT' there.

I found this question which has a similar problem, and the issue was that the SQL query was wrongly specified. But here, I don't specify a query, so I don't know where that happened or how to fix it.

As of now, I have no idea how to dive in deeper to find the cause of this. Does anyone have a solution or an idea of how to search further for the cause?

Alexander Engelhardt
  • 1,632
  • 3
  • 16
  • 31
  • 1
    What makes you think Hive is a "database" that supports transactions, rollbacks, atomic inserts/updates/deletes? OK, there has been some recent work to introduce **limited** support for all that, but are you sure your EMR cluster can do that -- and has been tweaked for that? – Samson Scharfrichter May 05 '18 at 20:15
  • 1
    What makes you think the Spark JDBC connector supports the Hive SQL dialect? The regular way to access Hive data from Spark is to access directly the Metastore and the HDFS data files, via SparkSQL. If you want to delegate query execution to HiveServer2 (and MapReduce) and use a single network link to transfer data with JDBC, then why use Spark at all? – Samson Scharfrichter May 05 '18 at 20:25
  • Between the RDD way and the Spark SQL way, I'm indifferent, but I chose the RDD's .write method because it was the first one where a possible solution popped up. If I switched to Spark SQL, I'd need to set the Hive Metastore to some external database - do you know how to do that? Googling for 'Spark SQL external metastore' didn't help much, unfortunately. – Alexander Engelhardt May 06 '18 at 08:14
  • 1
    https://stackoverflow.com/questions/31980584/how-to-connect-to-a-hive-metastore-programmatically-in-sparksql or https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/spark-config-hive.html _(since that's one of the basics when setting up Spark as part as a Hadoop distro)_ – Samson Scharfrichter May 06 '18 at 09:49
  • 2
    FYI the relevant Google search was _`spark hive-site.xml metastore.uris`_ -- it's way easier when you know what you search for... – Samson Scharfrichter May 06 '18 at 09:52

0 Answers0