0

I'm trying to fetch file from Google Drive using Apache Beam. I tried,

filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
    lines = (pipeline | beam.Create(filenames))
print(lines)

This returns a string like PCollection[[19]: Create/Map(decode).None]

I need to read a file from Google Drive and write it into GCS bucket. How can I read a file form G Drive from Apache beam?

codebot
  • 2,540
  • 3
  • 38
  • 89
  • Airflow has an operator to support this use case: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/transfer/gdrive_to_gcs.html – Bruno Volpato Nov 13 '22 at 17:07

2 Answers2

1

If you don’t have complex transformations to apply, I thinks it’s better to not use Beam in this case.

  • Solution 1 :

You can instead use Google Collab (Juypiter Notebook on Google servers), mount your gDrive and use the gCloud CLI to copy files.

You can check the following links :

google-drive-to-gcs

stackoverflow-copy-file-from-google-drive-to-gcs

  • Solution 2

You can also use APIs to retrieve files from Google Drive and copy them to Cloud Storage.

You can for example develop a Python script using Python Google clients and the following packages :

google-api-python-client 
google-auth-httplib2 
google-auth-oauthlib 
google-cloud-storage

This article shows an example.

Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23
1

If you want to use Beam for this, you would could write a function

def read_from_gdrive_and_yield_records(path):
    ...

and then use it like

filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
    paths = pipeline | beam.Create(filenames)
    records = paths | beam.FlatMap(read_from_gdrive_and_emit_records)
    records | beam.io.WriteToText('gs://...')

Though as mentioned, unless you have a lot of files, this may be overkill.

robertwb
  • 4,891
  • 18
  • 21