Is it possible to submit a job to a cluster using initization script on Google Dataproc?

Question

I am using Dataproc with 1 job on 1 cluster.

I would like to start my job as soon as the cluster is created. I found that the best way to achieve this is to submit a job using an initialization script like below.

function submit_job() {
  echo "Submitting job..."
  gcloud dataproc jobs submit pyspark ...
}
export -f submit_job

function check_running() {
  echo "checking..."
  gcloud dataproc clusters list --region='asia-northeast1' --filter='clusterName = {{ cluster_name }}' |
  tail -n 1 |
  while read name platform worker_count preemptive_worker_count status others
  do
    if [ "$status" = "RUNNING" ]; then
      return 0
    fi
  done
}
export -f check_running

function after_initialization() {
  local role
  role=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
  if [[ "${role}" == 'Master' ]]; then
    echo "monitoring the cluster..."
    while true; do
      if check_running; then
        submit_job
        break
      fi
      sleep 5
    done
  fi
}
export -f after_initialization

echo "start monitoring..."
bash -c after_initialization & disown -h

is it possible? When I ran this on Dataproc, a job is not submitted...

Thank you!

Dagang · Accepted Answer · 2021-09-08T03:29:41.913

2

Consider to use Dataproc Workflow, it is designed for workflows of multi-steps, creating cluster, submitting job, deleting cluster. It is better than init actions, because it is a first class feature of Dataproc, there will be a Dataproc job resource, and you can view the history.

edited Sep 08 '21 at 03:29

answered Sep 06 '21 at 15:53

Dagang

24,586
26
88
133

Thank you for your advice! As you suggest, I found it better to use dataproc workflow instead of initialization actions. – uchiiii Sep 14 '21 at 06:32

score 1 · Answer 2 · answered Sep 03 '21 at 15:40

1

Please consider to use cloud composer - then you can write a single script that creates the cluster, runs the job and terminates the cluster.

answered Sep 03 '21 at 15:40

David Rabinowitz

29,904
14
93
125

Thank you so much for your reply, David. Actually, I do not want to use composer since it is not cost-effective. – uchiiii Sep 05 '21 at 09:36

score 1 · Answer 3 · answered Sep 05 '21 at 09:40

1

I found a way. Put a shell script named await_cluster_and_run_command.sh on GCS. Then, add the following codes to the initialization script.

gsutil cp gs://...../await_cluster_and_run_command.sh /usr/local/bin/
chmod 750 /usr/local/bin/await_cluster_and_run_command.sh
nohup /usr/local/bin/await_cluster_and_run_command.sh &>>/var/log/master-post-init.log &

reference: https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/post-init/master-post-init.sh

answered Sep 05 '21 at 09:40

uchiiii

135
2
7

Did you consider Dataproc Workflow? https://cloud.google.com/dataproc/docs/concepts/workflows/overview – Dagang Sep 05 '21 at 21:56
Thank you for your comments. I missed Dataproc Workflow. It seems the better option. I will try that out! – uchiiii Sep 06 '21 at 06:48

Is it possible to submit a job to a cluster using initization script on Google Dataproc?

3 Answers3