3

I want read CSV files using latest Apache Spark Version i.e 2.2.1 in Windows 7 via cmd but unable to do so because there is some problem with the metastore_db. I tried below steps:

1. spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 //Since my scala 
                                                              // version is 2.11  
 2. val df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("file:///D:/ResourceData.csv")// As //in latest versions we use SparkSession variable i.e spark instead of //sqlContext variable  

but it throws me below error:

  Caused by: org.apache.derby.iapi.error.StandardException: Failed to start database 'metastore_db' with class loader o
.spark.sql.hive.client.IsolatedClientLoader  

Caused by: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database 

I am able to read csv in 1.6 version but I want to do it in latest version. Can anyone help me with this?? I am stuck since many days .

whatsinthename
  • 1,828
  • 20
  • 59
  • Do you have any other Spark application (incl. `spark-shell`) up and running? Can you edit your question and add the entire stack trace? – Jacek Laskowski Dec 29 '17 at 18:26
  • I dont have any other spark application running and its a long stacktrace. I cannot put it here so i just figured out the main cause line from the stacktrace – whatsinthename Dec 29 '17 at 18:50
  • BTW, you don't need `com.databricks:spark-csv_2.11:1.5.0` since it's part of Spark 2.x already. Remove it to have less to worry about. – Jacek Laskowski Dec 29 '17 at 21:53
  • tried without using databricks package but still getting the same error. I have updated my question Let me know if any additional information is required. – whatsinthename Dec 30 '17 at 06:41

2 Answers2

4

Open Spark Shell

spark-shell

Pass Spark Context through SQLContext and assign it to sqlContext Variable

 val sqlContext = new org.apache.spark.sql.SQLContext(sc) // As Spark context available as 'sc'

Read the CSV file as per your requirement

val bhaskar = sqlContext.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/home/burdwan/Desktop/bhaskar.csv") // Use wildcard, with * we will be able to import multiple csv files in a single load ...Desktop/*.csv

Collect the RDDs and Print

bhaskar.collect.foreach(println)

Output

_a1 _a2     Cn      clr clarity depth   aprx price  x       y       z
1   0.23    Ideal   E   SI2     61.5    55   326    3.95    3.98    2.43
2   0.21    Premium E   SI1     59.8    61   326    3.89    3.84    2.31
3   0.23    Good    E   VS1     56.9    65   327    4.05    4.07    2.31
4   0.29    Premium I   VS2     62.4    58   334    4.2     4.23    2.63
5   0.31    Good    J   SI2     63.3    58   335    4.34    4.35    2.75
6   0.24    Good    J   VVS2    63      57   336    3.94    3.96    2.48
Bhaskar Das
  • 652
  • 1
  • 9
  • 28
0

Finally even this also worked only in linux based O.S. Download apache spark from the official documentation and set it up using this link. Just verify whether you are able to invoke spark-shell. Now enjoy loading and performing actions with any type of file with the latest spark version. I don't know why its not working on windows even though I am running it for the first time.

whatsinthename
  • 1,828
  • 20
  • 59