I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.
The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be .csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.
@component(
packages_to_install=["google-cloud"],
base_image="python:3.9"
)
def main(
):
import csv
from io import StringIO
from google.cloud import storage
BUCKET_NAME = "gs://my_bucket"
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob('test/test.txt')
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
The error I'm getting looks like the following:
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found
What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.