8

I'm using Apache Spark 2.1.1 and I have put the following hive-site.xml on $SPARK_HOME/conf folder:

<?xml version="1.0"?>
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://mysql_server:3306/hive_metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
  <description>password to use against metastore database</description>
</property>
<property>
  <name>hive.metastore.schema.verification</name>
  <value>false</value>
  <description>password to use against metastore database</description>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>${test.tmp.dir}/hadoop-tmp</value>
  <description>A base for other temporary directories.</description>
</property>
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>hdfs://hadoop_namenode:9000/value_iq/hive_warehouse/</value>
  <description>Warehouse Location</description>
</property>
</configuration>

When I start the thrift server the metastore schema is created on my MySQL DB but is not used, instead Derby is used.

Could not find any error on the thrift server log file, the only thing that calls my attentions is that it attempts to use MySQL at first (INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL) but then without any error use Derby instead (INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY). This is the thrift server log https://www.dropbox.com/s/rxfwgjm9bdccaju/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-s-master.value-iq.com.out?dl=0

I have no hive installed on my system, I just pretend to use the built in Hive of Apache Spark.

I'm using mysql-connector-java-5.1.23-bin.jar which is located on $SPARK_HOME/jars folder.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • set, value of hive.metastore.schema.verification to false in hive-site.xml in both hive and spark conf and restart services and try again – Noman Khan Jul 29 '17 at 18:08
  • I have set it and same behavior. When you say in both hive and spark you mean in `$SPARK_HOME/conf/hive-site.xml` and `$SPARK_HOME/conf/spark-defaults.conf`? Remember I don't have hive installed, I'm using Spark built in Hive. – José Carlos Guevara Turruelles Jul 29 '17 at 20:31

1 Answers1

4

As it appears in the hive-site.xml you have not set the metastore service to connect to. So spark will use the default one which is local metastore service with derby DB backend
I order to use Metastore service that has MySQL DB as its backend, you have to :

  1. Start the metastore service. you can have a look here how to start the service hive metastore admin manual. You start your metastore service with the backend of MySQL DB, using your same hive-site.xml and you add the folowing lines to start the metastore service on METASTORESERVER on the port XXXX:

    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://METASTRESERVER:XXXX</value>
    </property>
    
  2. Let spark knows where the metastore service has started. That could be done using the same hive-site.xml you'have used when starting the metastore service (with the lines above added to it) copy this file into the configuration path of Spark, then restart your spark thrift server

user1314742
  • 2,865
  • 3
  • 28
  • 34
  • But I don't hive installed, I just have Apache Spark. So, if I start the thrift server located on $SPARK_HOME/sbin folder it will use the only hive-site.xml available on $SPARK_HOME/conf. Or there is something that I'm still not understanding? – José Carlos Guevara Turruelles Jul 31 '17 at 11:37
  • Unfortunately, I do not know another way to start the metastore service without using hive. your understaing is good, but the idea is that spark thrift server is build upon hive server, using metastore service which play the role of intermediating between hive/spark server and the backend database – user1314742 Jul 31 '17 at 11:45
  • to start a metastore using hive, you need just to download a version of hive unzip it and then put your hive-site.xml in the configuration of hive, then start its metastore server. no extra configuration needed. – user1314742 Jul 31 '17 at 11:51
  • Thanks for your answer, I have rewarded it. One thing, if I use hive then do I need to install it on every node of my cluster, or it will be enough to have it in one machine? – José Carlos Guevara Turruelles Jul 31 '17 at 12:18
  • If you want to use hive just to start the metastore, then you install it on one node in the cluster. NO need to install it on all nodes.
    thanks for the bounty, please when you feel this answer is the solution to the problem, accept it so it will be marked as answered ;)
    – user1314742 Jul 31 '17 at 12:23
  • The problem is that I want to use the thrift server, the one provided by spark to query it through JDBC, but if I start hive thrift server then the queries will run over hive not using the spark engine. What can I do? – José Carlos Guevara Turruelles Jul 31 '17 at 12:33
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/150613/discussion-between-user1314742-and-jose). – user1314742 Jul 31 '17 at 12:33
  • 2
    I have found the way to solve the issue, as you correctly suggest I have installed hive to use the metastore service, the problem was that I have installed Hive 2.1 and I was reading some documentation about spark and spark is build with Hive 1.2, so I have installed that one and then it works. Somehow when I start the thrift server it still logs `Using direct SQL, underlying DB is DERBY` but it connects to the hive metastore URI too. I have add in the question that I wasn't using Hive, so pls add to your answer that I should install Hive the 1.2 version and I'll accept it. Thanks a lot. – José Carlos Guevara Turruelles Aug 01 '17 at 16:03
  • Good catch!!, I forgot to say that. if you want to dive more in the dependencies and their versions have a look https://github.com/apache/spark/blob/v2.1.0/pom.xml#L133 – user1314742 Aug 01 '17 at 16:10
  • I actually found that the metastore service was not needed. After originally following these directions and getting it working with it, I then just removed `hive.metastore.uris`, shut down the hive metastore service and had spark connect direct to MySQL and that seemed to work exactly the same but without the hive requirement at all. Is there any downside of doing it this way? It seems to be what the OP tried to do but I'm not sure why it didn't work out then. – user1084563 Aug 07 '18 at 18:36