0

I am trying to load a normal text file into a hive table using Spark. I am using Spark version 2.0.2. I have done it successfully in Spark version: 1.6.0 and I am trying to do the same in version 2x I executed the below steps:

    import org.apache.spark.sql.SparkSession
    val spark = SparkSession.builder().appName("SparkHiveLoad").master("local").enableHiveSupport().getOrCreate()
    import spark.implicits._

There is no problem until now. But when I try to load the file into Spark:

val partfile = spark.read.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/partfile")

I am getting an exception:

Caused by: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database /home/cloudera/metastore_db.

The default property in core-site.xml:

 <property>
    <name>fs.defaultFS</name>
    <value>hdfs://quickstart.cloudera:8020</value>
  </property>

There were no other hive or spark sessions running on the background. I saw different questions with same exception. So read it once and if you still think it is a duplicate, you can mark it.

Could anyone tell me how can I fix it.

Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
Metadata
  • 2,127
  • 9
  • 56
  • 127
  • Please provide the full error – Sandeep Singh Jul 03 '17 at 14:47
  • Caused by: org.apache.derby.iapi.error.StandardException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6ba6ec73, see the next exception for details. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 144 more Caused by: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database /home/cloudera/metastore_db. – Metadata Jul 03 '17 at 14:48
  • 1
    Possible duplicate of [Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database](https://stackoverflow.com/questions/34465516/caused-by-error-xsdb6-another-instance-of-derby-may-have-already-booted-the-da) – T. Gawęda Jul 03 '17 at 19:15
  • @T.Gawęda The point at which the exception is coming in two questions are different. But if you could tell what is the similarity between the two questions other than same heading, I can try working on the solution of that question to solve mine – Metadata Jul 04 '17 at 14:56

1 Answers1

0

In Spark 2.0.2 spark.sparkContext.textFile is generally being used to read a textfile.

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Seqs or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.

Sample code:

mport org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

Please refer Spark Documentation for more information about it and to check other optons as well.

Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
  • Okay. I am able to get the data using this val partfile = spark.sparkContext.textFile("hdfs://quickstart:8020/user/cloudera/partfile") But when I typed 'val partfile = spark.read." and gave 'tab' I got the options: csv, format, jdbc, json, load, option, options, orc, parquet, schema, table, text and textFile. So it means there is an option right ? Also using spark.sparkContext, you just have a 'textFile' option for file reading. but you don't have the options to read the file in "json, parquet, ORC, etc". – Metadata Jul 03 '17 at 15:10
  • I have not tried with spark.read.textFile yet. I will check with it and update accordingly. – Sandeep Singh Jul 03 '17 at 15:11
  • Spark has in build library to read json, parquet and ord file. i.e. for json your code will be `val df = spark.read.json("examples/src/main/resources/people.json")` – Sandeep Singh Jul 03 '17 at 15:12
  • Exactly. You can see those options along with 'textFile', as listed in my first comment. But in our project, we need to process a '.txt' file. – Metadata Jul 03 '17 at 15:17
  • `spark.read` -> these formats are being used to read structured data where schema is supplied with it. Since textFile is not associated with schema ,functions are different to read it. – Sandeep Singh Jul 03 '17 at 15:17
  • I tried this one: val partfile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile") But this is resulted in: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------ Could you tell me how to fix it ? – Metadata Jul 03 '17 at 15:54
  • `hdfs dfs -chmod 755 /tmp/hive` – Sandeep Singh Jul 03 '17 at 17:11
  • Permission should be given to the user you are running the application – Sandeep Singh Jul 03 '17 at 17:11
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/148323/discussion-between-sidhartha-and-sandeep-singh). – Metadata Jul 04 '17 at 13:28