0

When an file is successfully uploaded to a given Google Cloud Storage bucket ("Object Finalize"), I would like to set up a trigger to let this filename be accessible within a VM which is running.

There is a standard Cloud Function which listens to when the file has been uploaded, with the trigger google.storage.object.finalize:

def hello_gcs(event, context):
    """Background Cloud Function to be triggered by Cloud Storage.
       This generic function logs relevant data when a file is changed.

    Args:
        event (dict):  The dictionary with data specific to this type of event.
                       The `data` field contains a description of the event in
                       the Cloud Storage `object` format described here:
                       https://cloud.google.com/storage/docs/json_api/v1/objects#resource
        context (google.cloud.functions.Context): Metadata of triggering event.
    Returns:
        None; the output is written to Stackdriver Logging
    """

    print('Event ID: {}'.format(context.event_id))
    print('Event type: {}'.format(context.event_type))
    print('Bucket: {}'.format(event['bucket']))
    print('File: {}'.format(event['name']))
    print('Metageneration: {}'.format(event['metageneration']))
    print('Created: {}'.format(event['timeCreated']))
    print('Updated: {}'.format(event['updated']))

https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python

(I'm using Python but I'm happy to use any of the other languages provided)

Let's say I have a VM named 'my-instance': Is there a way to pass the filename from event['name'] to the VM such that the code in the VM can access this?

There are other SO questions which discuss how to read files directly from Cloud Storage, e.g. Read csv from Google Cloud storage to pandas dataframe

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

But how can I pass the filename from the google.storage.object.finalize to the VM for this code to run?

EB2127
  • 1,788
  • 3
  • 22
  • 43
  • Can you clarify what exactly you mean under "pass the filename from event['name'] to the VM", please? Do you think that such "passing" significantly depends on on your "code" running in that VM? How does your "code" should become aware about that information? Are you going to modify that "code" as well? – al-dann Mar 02 '21 at 09:36
  • Hi @al-dann, thanks for the help. By "passing", I am referring to the trigger of the file being placed in the bucket. Once that event occurs, how can the VM access that exact file? In the python pandas example with `path.csv`, once `path.csv` is uploaded to the bucket, how can the pandas code access it? With respect to code in the VM, yes, I'm happy to edit it. Does this make sense? – EB2127 Mar 02 '21 at 14:03
  • I still don't understand how your "code" can accept external events. Let's say, for example, that your code can scan some directory in the VM every minute. And if a new file appears there, your code can do some thing useful. Or let's say your "code" can listen to a PubSub subscription, so, upon a message appears there - some function in your "code" can be invoked. Can your "code" in the VM do anything like that? – al-dann Mar 02 '21 at 14:15
  • The subsequent question - do you really need a VM to run your "code" - can it be a Cloud Function? Cloud Run? Dataflow? App Engine? - instead of a VM? – al-dann Mar 02 '21 at 14:16
  • Let's use the example of `python script.py path.csv` whereby `script.py` parses the argument. Does this make sense? For the other question, there are memory requirements which I need to fulfill. – EB2127 Mar 02 '21 at 14:31
  • Honestly - not really... But probably Guillaume's answer below can help. – al-dann Mar 02 '21 at 14:38
  • I'm happy to edit the question. Let's say `path.csv` is added to a bucket. I can write a trigger to do things when that happens, like spin up a new VM instance. The question is: when that new VM instance is spun up, how can the VM access that file `path.csv`? There's code in the VM which will work with that input file, e.g. `python read.py path.csv`. Is this more clear? – EB2127 Mar 02 '21 at 14:42

1 Answers1

2

You have 2 solutions to implement this in your VM: Pull or Push solution.

Push

It's the most obvious. Create a webserver on your VM and call it from your Cloud Function (if you have only a private IP on your VM, use a serverless VPC connector to access it, it's not a concern at the end). The webserver get the call from the Cloud Functions and perform what it need to get the file and trigger your code on your VM.

Pull

It's another solution. You can write a compute engine metadata on your running VM. Then, on your VM, in your running code, you have to check this compute engine metadata to see if a new entry has been register. Clear the metadata when the file has been correctly processed.


In both case, you need to update your running code to be informed of the new event.


EDIT 1

If you use startup script, you have 2 solutions:

  1. Either you update the startup script with the new file name with your Cloud Functions, and then start the VM. The file name is directly in the script and can be processed as is.
  2. Or update the metadata server with the new file name with your Cloud Functions, and then start the VM. The startup script have to check the metadata value to get the file name and process it
guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • 1
    Thanks Guillaume! This makes sense to me, and I've been expecting to modify the code within the VM. As described in the comments above, this question is actually related to a Cloud Function which is triggered when a new file is added to a bucket. https://stackoverflow.com/questions/66433532/how-can-i-execute-docker-run-via-a-cloud-function-on-a-vm-which-has-been-start/66436761#66436761 Based on this, I wonder if I can avoid a webserver... – EB2127 Mar 02 '21 at 14:45
  • "The file name is directly in the script and can be processed as is." I'll have to try to find an example, but I think this would work. `event['bucket']` and `event['name']` gives me the bucket and filename. I will need to figure out how the startup script could accept this... – EB2127 Mar 02 '21 at 14:59
  • @EB2127 were to able to pass the file name to the start script? – Jasmine Jul 31 '21 at 12:21