4

Is it possible that i set fully customized metric for auto scale-out with dataproc worker node in GCP (Google Cloud Platform)??

I want to run Spark distribution processing by dataproc in GCP. But the thing is that, i just want to horizontally scale out worker node based on fully customized metric data. The reason why i am curious about it is that prediction for future data expected to process is available.

now / now+1 / now+2 / now+3
1GB / 2GB / 1GB / 3GB <=== expected data volume (metric)

So could i predictable scale-out/in according to future expected data volumne ?? Thanks in advance.

jinsu park
  • 61
  • 2
  • 1
    Have you read this [chapter](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#how_autoscaling_works) from Dataproc documentation? The cluster autoscaling decision is made under YARN metrics based on available/pending memory consumed by the running containers. – Nick_Kh Dec 10 '20 at 13:21
  • Yes. i read that doc. What i want to do is scale dataproc by custom metrics determined by system's feature not YARN metrics. Do you know the way to do that? The reason why i'm trying to that is we can know how much resource is required in near future after analyzing previous history data. – jinsu park Dec 16 '20 at 04:23
  • Currently Dataproc clusters autoscaling is based on YARN metrics, thus to make this more visible you can file a feature [request](https://cloud.google.com/support/docs/issue-trackers) to developers for future implementation perspective. – Nick_Kh Dec 17 '20 at 08:15
  • Have you considered any other design approach to overcome this? – Nick_Kh Dec 21 '20 at 08:24

1 Answers1

0

No, currently Dataproc autoscales clusters only based on YARN memory metrics.

You need to write your Spark job in a way that it requests more Spark executors (and as a result YARN memory) when it processes more data, usually it means that you need to split and partition your data more when data size increases.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31