3

We've got multiple Google Cloud Dataflow jobs (written in Java / Kotlin), and they can be run in two different ways:

  1. Initiated from a user's Google Cloud account
  2. Initiated from a serviceaccount (with the required policies and permissions)

When running the Dataflow job from a user's account, Dataflow provides the default controller serviceaccount to the workers. It does not provide the authorized user to the workers.

When running the Dataflow job from the serviceaccount, I imagine that the serviceaccount that is set using setGcpCredential would be propagated to the worker VMs that Dataflow uses in the background. The JavaDocs don't mention any of this, but they do mention that the credentials are used to authenticate towards GCP services.

In most of our use cases for Dataflow, we run the Dataflow job in project A, while we read from BigQuery in project B. Hence, we provide the user with reader access to the BigQuery dataset in project B, as well as the serviceaccount used in the second way as described above. That same serviceaccount will also have the roles jobUser and dataViewer for BigQuery in project A.

Now, the issue is, that in both cases, we seem to need to provide the default controller serviceaccount with access to the BigQuery dataset that is used in the Dataflow job. In case we don't, we'll get a permission denied (403) for BigQuery, when the job tries to access the dataset in project B. For the second way as described, I'd expect Dataflow to be independent of the default controller serviceaccount. My hunch is that Dataflow does not propagate the serviceaccount that is set in the PipelineOptions to the workers.

In general, we provide project, region, zone, temporary locations (gcpTempLocation, tempLocation, stagingLocation), the runner type (in this case DataflowRunner), and the gcpCredential as PipelineOptions.

So, does Google Cloud Dataflow really propagate the provided serviceaccount to the workers?

Update

We first tried adding the options.setServiceAccount, as indicated by Magda, without adding IAM permissions. This results in the following error from the Dataflow logs:

{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : " Current user cannot act as service account dataflow@project.iam.gserviceaccount.com. Causes: Current user cannot act as service account dataflow@project.iam.gserviceaccount.com..",
    "reason" : "forbidden"
  } ],
  "message" : " Current user cannot act as service account dataflow@project.iam.gserviceaccount.com.. Causes: Current user cannot act as service account dataflow@project.iam.gserviceaccount.com.",
  "status" : "PERMISSION_DENIED"
}

After that, we tried to add roles/iam.serviceAccountUser to this service account. Unfortunately, that resulted in the same error. This serviceaccount already had the IAM roles Dataflow worker and BigQuery Job User. The default compute engine controller serviceaccount 123456-compute@developer.gserviceaccount.com only has the Editor role and we did not add any other IAM roles / permissions.

Robin Trietsch
  • 1,662
  • 2
  • 19
  • 31

1 Answers1

3

I think you need to set controller service account too. You can use options.setServiceAccount("hereYourControllerServiceAccount@yourProject.iam.gserviceaccount.com") in Dataflow Pipeline Options.

You will need to add some additional permissions:

  • For controller: Dataflow Worker and Storage Object Admin.

  • For executor: Service Account User.

That's what I found in Google's documentation and try out myself.

I think that might give you some insights:

For the BigQuery source and sink to operate properly, the following two accounts must have access to any BigQuery datasets that your Cloud Dataflow job reads from or writes to:

-The GCP account you use to execute the Cloud Dataflow job

-The controller service account running the Cloud Dataflow job

For example, if your GCP account is abcde@gmail.com and the project number of the project where you execute the Cloud Dataflow job is 123456789, the following accounts must all be granted access to the BigQuery Datasets used: abcde@gmail.com, and 123456789-compute@developer.gserviceaccount.com.

More on: https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#controller_service_account

Magda Kiwi
  • 501
  • 1
  • 3
  • 15
  • Hi, thanks for your response! The last part of your answer might be the issue we're facing. For the first part, about setting the serviceaccount in the pipeline options, that's something we've tried. Though that failed with a message like "... cannot act on behalf of serviceaccount ..." – Robin Trietsch Dec 12 '18 at 10:31
  • Do you set up proper Roles for service accounts in IAM? or can you show more Logs? – Magda Kiwi Dec 12 '18 at 10:37
  • I think I should add that default controller service account (so when you do not specify it) is default compute engine service account. – Magda Kiwi Dec 12 '18 at 12:17
  • I've update the OP with some more information about the error we encounter. – Robin Trietsch Dec 12 '18 at 13:05
  • As I see from your update dataflow@project.iam.gserviceaccount.com is going to be new controller this one should have Dataflow Worker and Storage Object Admin Roles and the one that you are using in setGcpCredentials, so your current user should have Service Account User. – Magda Kiwi Dec 12 '18 at 13:22
  • That's what worked for me, I was getting the same error before. – Magda Kiwi Dec 12 '18 at 13:29
  • controller role: https://cloud.google.com/dataflow/docs/concepts/access-control#example_role_assignment – Magda Kiwi Dec 12 '18 at 13:36
  • my error: https://stackoverflow.com/questions/53739459/dataflow-setting-controller-service-account – Magda Kiwi Dec 12 '18 at 13:41