GCP - CDAP - Dataproc cluster stucks in running state

Question

We have a DataFusion pipeline which is triggered by a Cloud Composer DAG. This pipeline provisions an ephemeral DataProc cluster which cluster - in an ideally scenario - terminates after finishing the tasks.

In our case, sometimes, not always, this ephemeral DataProc cluster stucks in a running state. The job inside in the cluster is also in a running state, and the last log messages are the followings:

INFO runtimejob.DataprocJobMain: Invoking initialize() on io.cdap.cdap.runtime.spi.runtimejob.DataprocRuntimeEnvironment with spark2_2.11
INFO runtimejob.DataprocJobMain: Invoking run() on io.cdap.cdap.internal.app.runtime.distributed.runtimejob.DefaultRuntimeJob
INFO runtimejob.DataprocJobMain: Invoking destroy() on io.cdap.cdap.internal.app.runtime.distributed.runtimejob.DefaultRuntimeJob
INFO runtimejob.DataprocJobMain: Runtime job completed.
Exception: java.lang.NoClassDefFoundError thrown from the UncaughtExceptionHandler in thread " STARTING-SendThread(cdap-<our-identifier>-1f11111b-1d11-11eb-b1a1-1a111fb11d11-m.c.<our-gcp-project-name>.internal:41409)"
Exception: java.lang.NoClassDefFoundError thrown from the UncaughtExceptionHandler in thread "threadDeathWatcher-2-1"

On the DataFusion side, the pipeline marked as successful. DataFusion logs are the followings:

Completed DEPROVISION subtask REQUESTING_DELETE for program run program_run: <data_fusion_namespace>.<pipeline_name>.-SNAPSHOT.workflow.DataPipelineWorkflow.<data_proc_id> //this message is repeated many-many times
DEBUG [provisioning-service-4:i.c.c.c.s.Retries@197] - Retries exhausted after 1 failures and 14 ms.

Any ideas what is causing this issue?

p.s.: identifiers in messages were replaced with random values

score 1 · Accepted Answer · answered Mar 29 '21 at 20:41

1

Which version of Datafusion are you running? Also what is the amount of memory for the Dataproc cluster? Sometimes we observe this issue when the Dataproc cluster ran out of memory. I would suggest increasing the amount of memory.

answered Mar 29 '21 at 20:41

Edwin Elia

399
3
5

we have one master and five workers in the cluster. All of them have 8 CPUs, and 30GB RAM. The number of records in the Datafusion are around 1 billion. The error I described above is undeterministic, sometimes happens sometimes not. Datafusion version is 6.2.0. – Robert Mar 29 '21 at 21:31
There was an issue with CDF version 6.2.0. Can you ugprade the instance to the patch version 6.2.3? – Edwin Elia Mar 29 '21 at 21:39
thanks for the information. Yes, we will definitely do that. With being said I mark your reply as the accepted one. Thank you again. – Robert Mar 30 '21 at 06:44

GCP - CDAP - Dataproc cluster stucks in running state

1 Answers1