What am I missing to activate ACID on HIVE in pyspark?

Question

I want to update some lines in my hive table. As pyspark doesn't recognize UPDATE at all I went for DELETE and INSERT, but am getting "Operation not allowed" on the DELETE operation.

To solve this I specified the table as orc and tried the rest of the requirements mentioned on this site: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Limitations

I did also set the tableproperty "transactional"="true". Below you will see some code how I tried setting the properties

sqlCtx.sql("""SET spark.hadoop.hive.support.concurrency=true""")
sqlCtx.sql("""SET spark.hadoop.hive.enforce.bucketing=true""")
sqlCtx.sql("""SET spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict""")
sqlCtx.sql("""SET spark.hadoop.hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager""")
sqlCtx.sql("""SET spark.hadoop.hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager""")
sqlCtx.sql("""SET spark.hadoop.hive.compactor.initiator.on=true""")
sqlCtx.sql("""SET spark.hadoop.hive.compactor.worker.threads=1""")

# Some other stuff happens creating the values etc
# Then I create the table from another table as orc

sqlCtx.sql("CREATE TABLE " + name + " AS SELECT * FROM new_base AS orc")
sqlCtx.sql("ALTER TABLE " + name + """ SET TBLPROPERTIES("transactional"="true")""")

# This will now result in Operation not allowed

sqlCtx.sql("DELETE FROM " + name) # I didn't keep the Where clause as it makes no difference so the error is not in the missing Where clause

I expected the DELETE clause to do something, at least Throwing an error due to the missing Where clause, but I only get pyspark.sql.utils.ParseException: '\nOperation not allowed: DELETE FROM ...

If a complete code example with the creation of a table is more helpful write so in the comments and I will add it, I kept it away for better readibility. I should also note, that this is run completely locally.

jaimin03 · Answer 1 · 2019-07-26T09:15:19.147

0

I think you should add respective hive configurations in file hive-site.xml

Also hive transactions feature will work only if you are running separate hive server and you have to fire your queries against hive server. The hive in spark is just embedded metastore (to store metadata processing done by spark). As embedded metastore having don't have hive server it won't work.

To use transactions you need to install hive and set those properties in hive-site.xml then connect using spark to hive server url refer https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID How to access the HIVE ACID table in Spark sql?

For example corresponsing hive property for spark.hadoop.hive.support.concurrency=true is

<property>
  <name>hive.support.concurrency</name>
  <value>true</value>
</property>

edited Jul 26 '19 at 09:15

answered Jul 26 '19 at 08:46

jaimin03

176
9

With my current setup the code automatically creates a local folder called spark-warehouse containing a folder with the name of the table which in turn contains 2 files 1 with .crc one without. It feels like a real hive server imo, but I am now trying to set up a separate hive-server to test. – Felix Jul 26 '19 at 09:36
Did creating hive server worked? If you find answer helpful then please accept it. – jaimin03 Jul 30 '19 at 06:27
I am still trying to solve the problem. I installed hadoop as in this guide: https://exitcondition.com/install-hadoop-windows/ and hive as described in this guide https://kontext.tech/docs/DataAndBusinessIntelligence/p/apache-hive-300-installation-on-windows-10-step-by-step-guide and I am still unable to get it to work. Also trying to run the code of your link gives me an error. – Felix Jul 30 '19 at 06:47

What am I missing to activate ACID on HIVE in pyspark?

1 Answers1