1

According to the documentation there is an option to use an existing Dataproc cluster in 6.2 version and above.

We use Cloud Data Fusion 6.2.0 but the existing Dataproc does not appear when we try to create a new compute profile. no existing dataproc option

What are we doing wrong? Why does the described option not show up? Do we have to do some additional configurations?

UPDATE 1

When I choose Dataproc, I see the followings: enter image description here enter image description here

UPDATE 2

When we try to use Remote Hadoop Provisioner we got the following error message in the /logs/program.log file. SSH connection is successful because the run-id folder is there.


2021-06-15 09:40:37,617 - ERROR [main:o.a.z.s.NIOServerCnxnFactory@44] - Thread Thread[main,5,main] died
java.lang.reflect.InvocationTargetException: null
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_282]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_282]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_282]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_282]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteLauncher.main(RemoteLauncher.java:73) ~[launcher.jar:na]
Caused by: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) ~[hadoop-common-3.2.2.jar:na]
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) ~[hadoop-common-3.2.2.jar:na]
        at io.cdap.cdap.common.conf.CConfigurationUtil.copyTxProperties(CConfigurationUtil.java:100) ~[na:na]
        at io.cdap.cdap.common.guice.ConfigModule.<init>(ConfigModule.java:62) ~[na:na]
        at io.cdap.cdap.common.guice.ConfigModule.<init>(ConfigModule.java:49) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionJobMain.initialize(RemoteExecutionJobMain.java:117) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionJobMain.doMain(RemoteExecutionJobMain.java:98) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionJobMain.main(RemoteExecutionJobMain.java:73) ~[na:na]
        ... 5 common frames omitted
Robert
  • 127
  • 2
  • 11

2 Answers2

1

For 6.2.0 , "Remote Hadoop Provisioner" is the right option to use for existing dataproc cluster. And the stucking issue you met with is caused by a rare case where API activation failed to assign the necessary role to the Dataproc-specific service account. This problem can be solved simply by granting the following service account the "Dataproc Service Agent" role in your project:

service-${project number}@dataproc-accounts.iam.gserviceaccount.com

Sean Zhou
  • 21
  • 1
  • Role is granted, the issue still the same. I have edited the post, see the exact error message we got, please. – Robert Jun 15 '21 at 09:44
0

I wasn't able to reproduce exactly the scenario since when creating a CDF instance from the scratch I was able to select Cloud Data Fusion 6.2.3 as similar closer version.

I can confirm that on version 6.2.3 you have the option to choose an Existing Dataproc Cluster. Therefore I would recommend to you to upgrade to at least that version. Follow this docs in order to do it in a safe way.

As alternative there is a method to configure Cloud Data Fusion pipeline to run against existing cluster here. This feature is available only on the Enterprise edition of Cloud Data Fusion.

davidmesalpz
  • 133
  • 6
  • You can create CDF on version 6.2.0 using REST API. I have deployed a CDF instance with version 6.2.1 and Existing Dataproc Cluster option shows up. We use the enterprise edition but we arent in the position to upgrade the existing CDF instance from 6.2.0 to any versions. When we try to use Remote Hadoop Provisioner option on CDF 6.2.0 pipelines stuck in Starting state. CDF can ssh into the cluster but in folder /logs/program.log we see ERROR [main:o.a.z.s.NIOServerCnxnFactory@44] - Thread Thread[main,5,main] died java.lang.reflect.InvocationTargetException: null – Robert Jun 04 '21 at 09:43