9

I have installed hadoop 2.8.1 on ubuntu and then installed spark-2.2.0-bin-hadoop2.7 on it. I used spark-shell and created the tables. Again I used beeline and created tables. I have observed that there are three different folders got created named spark-warehouse as :

1- spark-2.2.0-bin-hadoop2.7/spark-warehouse

2- spark-2.2.0-bin-hadoop2.7/bin/spark-warehouse

3- spark-2.2.0-bin-hadoop2.7/sbin/spark-warehouse

What is exactly spark-warehouse and why are these created many times? Sometimes my spark shell and beeline shows different databases and tables and sometimes it show same. I am not getting what is happening?

Further, I did not installed hive but still I am able to use beeline and also I can access the databases though java program. How the hive came on my machine? Please help me. I am new to spark and installed it by online tutorials.

Below is the java code I was using to connect apache spark though JDBC:

 private static String driverName = "org.apache.hive.jdbc.HiveDriver";

public static void main(String[] args) throws SQLException {
    try {
        Class.forName(driverName);
    } catch (ClassNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        System.exit(1);
    }
    Connection con = DriverManager.getConnection("jdbc:hive2://10.171.0.117:10000/default", "", "");
    Statement stmt = con.createStatement();
ABC
  • 354
  • 2
  • 3
  • 13
  • Possible duplicate of [How to connect to remote hive server from spark](https://stackoverflow.com/questions/39997224/how-to-connect-to-remote-hive-server-from-spark) – OneCricketeer Aug 28 '17 at 09:17
  • Suggestion: Use a fully setup Hadoop installation environment like Hortonworks Sandbox or Cloudera Quickstart for HDFS+YARN+Hive+Spark – OneCricketeer Aug 28 '17 at 09:18

3 Answers3

11

What is exactly spark-warehouse and why are these created many times?

Unless configured otherwise, Spark will create an internal Derby database named metastore_db with a derby.log. Looks like you've not changed that.

This is the default behavior, as point out in the Documentation

When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started

Sometimes my spark shell and beeline shows different databases and tables and sometimes it show same

You're starting those commands in those different folders, so what you see is only confined to the current working directory.

I used beeline and created tables... How the hive came on my machine?

It didn't. You're probably connecting to the either the Spark Thrift Server, which is fully compatible with HiveServer2 protocol, the Derby database, as mentioned, or, you actually do have a HiveServer2 instance sitting at 10.171.0.117

Anyways, the JDBC connection is not required here. You can use SparkSession.sql function directly.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • I tried using the below codes w/o jdbc but it does not show me the tables which are there in the db. Can you please provide me details here on (https://stackoverflow.com/questions/45833210/can-beeline-and-spark-shell-show-different-databases-for-same-apache-spark) The below java code shows 0 tables available.. – ABC Aug 28 '17 at 10:32
  • SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") .master("local") .config("spark.sql.warehouse.dir", "file://home/user/Desktop/spark-2.2.0-bin-hadoop2.7/spark-warehouse") .enableHiveSupport() .getOrCreate(); spark.sql("Show tables").show(); – ABC Aug 28 '17 at 10:36
  • Well what does `cd $SPARK_HOME/spark-warehouse && beeline -u jdbc:hive2://10.171.0.117:10000/default -e "show tables"` give you? – OneCricketeer Aug 28 '17 at 10:43
  • spark-warehouse folder shows three database i.e. db1 db2 and default + tab1; same is shown at beeline by "show tables". I mean show databases and show tables. I am not able to execute above command. – ABC Aug 28 '17 at 10:58
  • How about `spark.sql("Show databases")`? – OneCricketeer Aug 28 '17 at 11:00
  • The above code through java shows only 'default' database and through spark-shell it shows all three db1 db2 default.. Can we continue on personnel mails if u feel comfortable? (b.ajinkya1@gmail.com) – ABC Aug 28 '17 at 11:06
  • Where did you launch spark-shell? Apparently not within `/home/user/Desktop/spark-2.2.0-bin-hadoop2.7/` – OneCricketeer Aug 28 '17 at 11:07
  • Yes, I launched it from the above location..Is there any problem due to this? Why doesn't it show in java program? – ABC Aug 28 '17 at 11:10
  • Not sure. Start by clearing out all the remnant warehouse directories. Then run the Java program to see where it makes it, and then get `spark-sql` command to connect to that – OneCricketeer Aug 28 '17 at 11:15
  • Okay thank for ur time.. I will try. I the ans u said that I have configured derby. Does this mean that I'm no more able to process big data? – ABC Aug 28 '17 at 11:19
  • You don't have Hive or Hadoop configured. The default Spark package is using a Derby database, not a remote metastore like Postegres or MySQL... That is a whole separate problem. Yes, you can process "big data", but anything that fits in memory or on your single computer is not "big" in today's sense – OneCricketeer Aug 28 '17 at 11:22
  • Anyways, if you carefully read the documentation, you'll see how Hive integration actually works, and why folders are created where they are... http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables – OneCricketeer Aug 28 '17 at 11:27
  • Looking at your code, I think the issue is should be `file:///home` (3 slashes). You missed the root folder – OneCricketeer Aug 28 '17 at 11:33
  • It does not seem easy to get. I thought since I installed hadoop before spark so it will do the job. Since I am wrong I will need to start from the beginning. I will try in fully set up environment as u advice. – ABC Aug 28 '17 at 11:36
  • Use of three slashes also give one database but not three. – ABC Aug 28 '17 at 11:38
  • Thanks. My goal is to just read the data from Apache spark through java and use it for my purpose by storing it somewhere or use spark sql on this data. – ABC Aug 28 '17 at 11:45
  • https://hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-java/ – OneCricketeer Aug 28 '17 at 11:48
1

In standalone mode, Spark will create the metastore in the directory from where it was launched. This is explained here: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

So you should set spark.sql.warehouse.dir, or simply make sure you always start your spark job from the same directory (run bin/spark instead of cd bin ; ./spark, etc.).

FurryMachine
  • 1,543
  • 14
  • 12
0

Here are my two cents, if you are using hive to execute sql from command line, the spark-warehouse is also created at the launching directory.

In this situation, you need to specify hive.metastore.warehouse.dir under $HIVE_HOME/conf/hive-site.xml.

Relaunch hive metastore service and the hive warehouse is changed, the spark-warehouse won't be created anymore.

Eugene
  • 10,627
  • 5
  • 49
  • 67