0

I imported data using sqoop in a sequence file and I am loading that data using spark-shell. The generated code from spark has references to classes in com.cloudera.sqoop.lib package. Running the command in spark-shell generates the following error:

  val ordersRDD = sc.sequenceFile("/user/pawinder/problem1-seq/orders",classOf[org.apache.hadoop.io.IntWritable],classOf[com.problem1.retaildb.orders])
    warning: Class com.cloudera.sqoop.lib.SqoopRecord not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.LargeObjectLoader not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.LargeObjectLoader not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.DelimiterSet not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.DelimiterSet not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.DelimiterSet not found - continuing with a stub.
    warning: Class com.cloudera.sqoop.lib.RecordParser not found - continuing with a stub.
    error: Class com.cloudera.sqoop.lib.SqoopRecord not found - continuing with a stub.

Can I instruct sqoop to generate the code without having a dependency on cloudera package? Do I need to add the jar file having com.cloudera.sqoop.lib package while starting spark-shell? Where can I find the jar file? Should I write the code for the value class so that it does not have dependency on com.cloudera.sqoop.lib package?

I am using cloudera quickstart vm. Many thanks for your help.

EDIT: The issue is resolved by adding sqoop-1.4.6.2.6.5.0-292.jar to spark2-shell

 spark-shell --jars problem1/bin/orders.jar,/usr/hdp/2.6.5.0-292/sqoop/sqoop-1.4.6.2.6.5.0-292.jar

I tried to resolve this by defining a case class for Orders, but that did not work. The MapReduce job still had a reference to com.cloudera.sqoop package classes

scala> case class Orders(order_id:Int,order_date:java.sql.Timestamp,customer_id:Int,status:String)
defined class Orders
scala> val ordersRDD = sc.sequenceFile("/user/pawinder/problem1-seq/orders",classOf[org.apache.hadoop.io.IntWritable],classOf[Orders])
 ordersRDD: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.IntWritable, Orders)] = /user/pawinder/problem1-seq/orders HadoopRDD[0] at sequenceFile at <console>:26

scala> ordersRDD.count
    19/05/14 14:29:21 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
    java.lang.NoClassDefFoundError: com/cloudera/sqoop/lib/SqoopRecord
pawinder gupta
  • 1,225
  • 16
  • 35
  • can you tell me how to do same in pyspark – Karthik Jan 29 '20 at 14:20
  • Check this link to add jar files to pyspark. https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark – pawinder gupta Jan 31 '20 at 14:29
  • Even though iam adding jar files , when i am entering class name, in value place i.e. while reading sequence file. pyspark still throwing error class name not defined. – Karthik Jan 31 '20 at 15:01

0 Answers0