Query HIVE table in pyspark

Question

I am using CDH5.5

I have a table created in HIVE default database and able to query it from the HIVE command.

Output

hive> use default;

OK

Time taken: 0.582 seconds


hive> show tables;

OK

bank
Time taken: 0.341 seconds, Fetched: 1 row(s)

hive> select count(*) from bank;

OK

542

Time taken: 64.961 seconds, Fetched: 1 row(s)

However, I am unable to query the table from pyspark as it cannot recognize the table.

from pyspark.context import SparkContext

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)


sqlContext.sql("use default")

DataFrame[result: string]

sqlContext.sql("show tables").show()

+---------+-----------+

|tableName|isTemporary|

+---------+-----------+

+---------+-----------+


sqlContext.sql("FROM bank SELECT count(*)")

16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*)
16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql
      return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
    File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",   line 538, in __call__
    File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco
      raise AnalysisException(s.split(': ', 1)[1])
  **pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**

New Error

>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> bank = hive_context.table("default.bank")
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table
    return DataFrame(self._ssql_ctx.table(tableName), self)
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.table.
: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
    at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
    at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
    at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
    at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
    at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
    at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
    at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

thanks

score 51 · Answer 1 · edited Sep 18 '17 at 05:11

51

We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One way to read Hive table in pyspark shell is:

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("default.bank")
bank.show()

To run the SQL on the hive table: First, we need to register the data frame we get from reading the hive table. Then we can run the SQL query.

bank.registerTempTable("bank_temp")
hive_context.sql("select * from bank_temp").show()

edited Sep 18 '17 at 05:11

smanurung

39
7

answered Mar 17 '16 at 04:38

bijay697

669
4
5

bank = hive_context.table("bank") Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table return DataFrame(self._ssql_ctx.table(tableName), self)File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o30.table. – Chn Mar 17 '16 at 04:55
I have edited the answer to include the database name. It should work now. – bijay697 Mar 17 '16 at 19:31
hi Bijay697, I am getting error org.apache.spark.sql.catalyst.analysis.NoSuchTableException. I updated the error in my original post (under new error). Is there any special config needed to access to HIVE Metastore? – Chn Mar 23 '16 at 01:41
The error message means the table is no present in the Hive. Can you try creating a table in another database instead of default in a Hive. Also, if you are submitting your job in cluster mode you may need to pass hive-site.xml. – bijay697 Apr 25 '16 at 02:52
What is 'sc' in your example above? – Sledge Apr 18 '17 at 16:47
1

@Sledge that is the SparkContext, a default variable in the session – Jan Sila Jul 27 '17 at 10:01
when the sql query is multiline, I get `ParseException: u'\nmismatched input \'nodeinv\' expecting (line 1, pos 5)` – Pyd Jul 08 '19 at 10:05
@bijay697 I tried executing the above query in pyspark but it failed with below given error. Can you please help pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':" – Jon Andrews Apr 07 '20 at 11:14

score 15 · Answer 2 · answered Jun 28 '16 at 08:18

15

SparkSQL gets shipped with its own metastore (derby), so that it can work even if hive is not installed on the system.This is the default mode.

In the above question, you created a table in hive. You get the table not found error because SparkSQL is using its default metastore which doesn't have metadata of your hive table.

If you want SparkSQL to use the hive metastore instead and access hive tables, then you have to add hive-site.xml in spark conf folder.

answered Jun 28 '16 at 08:18

sujit

455
6
12

What do you mean by "spark conf folder" ? Anything specific to mention if pyspark is ran in a Zeppelin application ? – Julian Oct 06 '20 at 15:40

score 4 · Answer 3 · edited Mar 10 '20 at 12:24

4

Solution to my problem was to, cp the hive-site.xml to your $SPARK_HOME/conf, and cp the mysql-connect-java-*.jar to your $SPARK_HOME/jars, this solution solved my problem.

edited Mar 10 '20 at 12:24

Tshilidzi Mudau

7,373
6
36
49

answered Sep 01 '17 at 09:37

Huang

41
1

Even this worked for me. But I need to know how can I do it programatically without copying hive-site.xml to spark's conf directory – nayak0765 Aug 12 '23 at 15:36

score 1 · Answer 4 · edited Aug 07 '22 at 02:35

This is how I initialised sc to get the hive table records and not just the metadata of it

from pyspark import SparkConf, SparkContext     
conf = SparkConf().setMaster("yarn-client")     
sc = SparkContext(conf =conf)      
from pyspark import HiveContext      
hive_context=HiveContext(sc)      
data=hive_context.table("database_name.table_name")     
data.registerTempTable("temp_table_name")     
hive_context.sql("select * from temp_table_name limit 10").show()

score -8 · Answer 5 · edited Mar 10 '20 at 12:37

-8

you can use sqlCtx.sql. The hive-site.xml should be copied to spark conf path.

my_dataframe = sqlCtx.sql("Select * from categories")
my_dataframe.show()

edited Mar 10 '20 at 12:37

Igor F.

2,649
2
31
39

answered Mar 18 '16 at 17:03

Sagar Shah

118
4

Query HIVE table in pyspark

5 Answers5

Linked