I am running a small Amazon EMR cluster and wish to write to its Hive database from a remote connection via JDBC. I am running into an error that also appears if I execute everything locally on that EMR cluster, which is why I think the fault is not the remote connection but something directly on EMR.
The error appears when running this minimal example:
connectionProperties = {
"user" : "aengelhardt",
"password" : "doot",
"driver" : "org.apache.hive.jdbc.HiveDriver"
}
from pyspark.sql import DataFrame, Row
test_df = sqlContext.createDataFrame([
Row(name=1)
])
test_df.write.jdbc(url= "jdbc:hive2://127.0.0.1:10000", table = "test_df", properties=connectionProperties, mode="overwrite")
I then get a lot of Java error messages, but I think the important lines are these:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 940, in jdbc
self.mode(mode)._jwrite.jdbc(url, table, jprop)
File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o351.jdbc.
: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:23 cannot recognize input near '"name"' 'BIGINT' ')' in column name or primary key or foreign key
The last line hints that something came up while creating the table, since he tries to specifiy the 'name' column as a 'BIGINT' there.
I found this question which has a similar problem, and the issue was that the SQL query was wrongly specified. But here, I don't specify a query, so I don't know where that happened or how to fix it.
As of now, I have no idea how to dive in deeper to find the cause of this. Does anyone have a solution or an idea of how to search further for the cause?