3

I am trying to create dataproc cluster via DataprocClusterCreateOperator via Apache airflow Airflow version: 1.10.15 Composer version: 1.16.4 I wanted to assign a temp-bucket used by project to the cluster and not the bucket google creates during run time. This option is available when we create cluster via command line using option --temp-bucket but this same variable is not available to pass via ClusterCreateOperator.

Dataproc operator info: https://airflow.apache.org/docs/apache-airflow/1.10.15/_modules/airflow/contrib/operators/dataproc_operator.html

create cluster via command:

gcloud dataproc clusters create cluster-name \
    --properties=core:fs.defaultFS=gs://defaultFS-bucket-name \
    --region=region \
    --bucket=staging-bucket-name \
    **--temp-bucket=project-owned-temp-bucket-name \** 
    other args ...
create_cluster = DataprocClusterCreateOperator(
        task_id="create_cluster",
        project_id="my-project_id",
        cluster_name="my-dataproc-{{ ds_nodash }}",
        num_workers=2,
        storage_bucket="project_bucket",
    region="us-east4",
       ... other params...
    )
naval m
  • 61
  • 4

1 Answers1

0

Unfortunately, the method DataprocClusterCreateOperator in Ariflow doesn’t support the property temp-bucket. You can use this property only with the gcloud command or REST API.

With REST API, you can use these fields ClusterConfig.configBucket and ClusterConfig.tempBucket in a cluster.create.

A Possible solution could be creating a scheduler job. You can see this documentation.

Raul Saucedo
  • 1,614
  • 1
  • 4
  • 13
  • @naval if this or any answer has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – Raul Saucedo Mar 14 '22 at 15:32