3

I am using Dataproc with 1 job on 1 cluster.

I would like to start my job as soon as the cluster is created. I found that the best way to achieve this is to submit a job using an initialization script like below.

function submit_job() {
  echo "Submitting job..."
  gcloud dataproc jobs submit pyspark ...
}
export -f submit_job

function check_running() {
  echo "checking..."
  gcloud dataproc clusters list --region='asia-northeast1' --filter='clusterName = {{ cluster_name }}' |
  tail -n 1 |
  while read name platform worker_count preemptive_worker_count status others
  do
    if [ "$status" = "RUNNING" ]; then
      return 0
    fi
  done
}
export -f check_running

function after_initialization() {
  local role
  role=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
  if [[ "${role}" == 'Master' ]]; then
    echo "monitoring the cluster..."
    while true; do
      if check_running; then
        submit_job
        break
      fi
      sleep 5
    done
  fi
}
export -f after_initialization

echo "start monitoring..."
bash -c after_initialization & disown -h

is it possible? When I ran this on Dataproc, a job is not submitted...

Thank you!

uchiiii
  • 135
  • 2
  • 7

3 Answers3

2

Consider to use Dataproc Workflow, it is designed for workflows of multi-steps, creating cluster, submitting job, deleting cluster. It is better than init actions, because it is a first class feature of Dataproc, there will be a Dataproc job resource, and you can view the history.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • Thank you for your advice! As you suggest, I found it better to use dataproc workflow instead of initialization actions. – uchiiii Sep 14 '21 at 06:32
1

Please consider to use cloud composer - then you can write a single script that creates the cluster, runs the job and terminates the cluster.

David Rabinowitz
  • 29,904
  • 14
  • 93
  • 125
  • Thank you so much for your reply, David. Actually, I do not want to use composer since it is not cost-effective. – uchiiii Sep 05 '21 at 09:36
1

I found a way. Put a shell script named await_cluster_and_run_command.sh on GCS. Then, add the following codes to the initialization script.

gsutil cp gs://...../await_cluster_and_run_command.sh /usr/local/bin/
chmod 750 /usr/local/bin/await_cluster_and_run_command.sh
nohup /usr/local/bin/await_cluster_and_run_command.sh &>>/var/log/master-post-init.log &

reference: https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/post-init/master-post-init.sh

uchiiii
  • 135
  • 2
  • 7
  • Did you consider Dataproc Workflow? https://cloud.google.com/dataproc/docs/concepts/workflows/overview – Dagang Sep 05 '21 at 21:56
  • Thank you for your comments. I missed Dataproc Workflow. It seems the better option. I will try that out! – uchiiii Sep 06 '21 at 06:48