3

I'm trying to test the Spark-HBase connector in the GCP context and tried to follow 1, which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and get following error when submitting the job on Dataproc (after having completed [3]).

Any idea ?

Thanks for your support

References

1 https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

[2] https://github.com/hortonworks-spark/shc/tree/branch-2.4

[3] Spark-HBase - GCP template (1/3) - How to locally package the Hortonworks connector?

Command

(base) gcloud dataproc jobs submit spark --cluster $SPARK_CLUSTER --class com.example.bigtable.spark.shc.BigtableSource --jars target/scala-2.11/cloud-bigtable-dataproc-spark-shc-assembly-0.1.jar --region us-east1 -- $BIGTABLE_TABLE

Error

Job [d3b9107ae5e2462fa71689cb0f5909bd] submitted. Waiting for job output... 20/12/27 12:50:10 INFO org.spark_project.jetty.util.log: Logging initialized @2475ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: Started @2576ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 20/12/27 12:50:10 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at spark-cluster-m/10.142.0.10:8032 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at spark-cluster-m/10.142.0.10:10200 20/12/27 12:50:13 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1609071162129_0002 Exception in thread "main" java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:262) at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:84) at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at com.example.bigtable.spark.shc.BigtableSource$.delayedEndpoint$com$example$bigtable$spark$shc$BigtableSource$1(BigtableSource.scala:56) at com.example.bigtable.spark.shc.BigtableSource$delayedInit$body.apply(BigtableSource.scala:19) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.example.bigtable.spark.shc.BigtableSource$.main(BigtableSource.scala:19) at com.example.bigtable.spark.shc.BigtableSource.main(BigtableSource.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 20/12/27 12:50:20 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

py-r
  • 419
  • 5
  • 15

1 Answers1

1

Consider reading these related SO questions: 1 and 2.

Under the hood the tutorial you followed, as well of one of the question indicated, use the Apache Spark - Apache HBase Connector provided by HortonWorks.

The problem seems to be related with an incompatibility with the version of the json4s library: in both cases, it seems that using version 3.2.10 or 3.2.11 in the build process will solve the issue.

Add following dependency in pom.xml (shc-core):

<dependency>
  <groupId>org.json4s</groupId>
  <artifactId>json4s-jackson_2.11</artifactId>
  <version>3.2.11</version>
</dependency>
py-r
  • 419
  • 5
  • 15
jccampanero
  • 50,989
  • 3
  • 20
  • 49
  • Thanks ! Looks like it helped (will share exact pom). I'm now facing another issue: ```...Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration...``` A quick check points out to HADOOP_CLASSPATH definition: anything missing on GCP side ? – py-r Dec 27 '20 at 19:25
  • I am happy to hear that the answer was helpful. I do not think so, it seems another problem with the connector: https://github.com/hortonworks-spark/shc/issues/223. I will put my money in including a compatible version of the HBase client (`HBase-common.jar`) in your job uber jar. – jccampanero Dec 27 '20 at 23:11
  • Thanks for your support ! I wouldn't know how much money to put on this one personally ;) Glad to open a new question by the way. – py-r Dec 28 '20 at 09:12
  • You are welcome @py-r. Well, yes, please, test it, probably you need to include any other additional dependencies. By the way, although it is normally unnecessary in Dataproc, all is already configured for you, if you ever need to set the `HADOOP_CLASSPATH`, there is a common pattern: `HADOOP_CLASSPATH=\`hadoop classpath\` command`. You can do that manually by performing ssh in your cluster, but if you need a repeatable behavior, you can set the env vars with an init action and modify `/etc/environment`. Please, do not hesitate to contact me if you need further help. – jccampanero Dec 28 '20 at 09:52
  • I forgot to mention, thank you very much for the suggested edit, I appreciate it a lot. – jccampanero Dec 28 '20 at 09:54
  • Thanks. Adding other [dependencies](https://community.cloudera.com/t5/Support-Questions/java-lang-NoClassDefFoundError-using-Hbase-Storage-Handler/td-p/14426) straight to the GCP package helped, but now stuck on following error: ```20/12/28 11:54:04 INFO com.google.bigtable.repackaged.com.google.cloud.bigtable.grpc.io.OAuthCredentialsCache: Refreshing the OAuth token 20/12/28 11:54:05 WARN com.google.cloud.bigtable.hbase2_x.BigtableAdmin: getNamespaceDescriptor is a no-op Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableOutputFormat ...``` – py-r Dec 28 '20 at 12:05
  • The error seems to be another issue of shc: https://github.com/hortonworks-spark/shc/issues/303. Where did you get the error? When submitting the job? Please, be aware that the link you provided is from Cloudera, and you are using HortonWorks. Maybe some libraries can be different. Maybe this SO [answer](https://stackoverflow.com/questions/54181943/hbase-mapreduce-tableoutputformat-not-found) could be helpful? – jccampanero Dec 28 '20 at 12:58
  • Thanks. I've found documented the way [here](https://stackoverflow.com/questions/65483442/spark-hbase-gcp-template-3-3-missing-libraries/65483468#65483468). Thanks a lot for the hints ! – py-r Dec 28 '20 at 20:20
  • You are welcome @py-r. Please, if you think that I can be of any help, do not hesitate to contact me, I will be glad to help you if I can. – jccampanero Dec 28 '20 at 21:59