1

I want to build an ETL pipeline that:

  1. Read files from the filesystem on-prem
  2. Write the file into a Cloud Storage bucket. Is it possible to import the files (regurarly, every day) directly with the Storage Transfer Service? Let's suppose I want to build the pipeline with Dataflow (with Python as programming language). Is it possible to implement such workflow? If yes, are there any Python exaples with Apache Beam?

Thank you in advance

alex-mont
  • 75
  • 7

2 Answers2

1

Since you stated that importing is a daily task, you may opt to use Cloud Composer instead of Dataflow, as discussed in this SO post. You can check the product details here. Cloud Composer uses Apache Airflow. You can use sftpOperator and localtogcsOperator to achieve your requirement.

If you opt to use Cloud Composer, you can post another question in SO for this specific product with correct tagging so that others in the community can easily find answer to your question, and I will gladly share a working code with correct output with you.

Anjela B
  • 1,150
  • 1
  • 2
  • 7
  • Thanks for reply. This is the SO post of the new detailed question https://stackoverflow.com/questions/74054586/how-to-transfer-files-and-directories-from-remote-server-to-gcs-buckets – alex-mont Oct 13 '22 at 10:54
  • I posted an answer to that SO post. You may also check this SO guideline, [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers#:~:text=To%20mark%20an%20answer%20as,the%20answer%2C%20at%20any%20time.) – Anjela B Oct 17 '22 at 02:23
0

Do you need any transformation of the files, or simply copy them?

If copying, with Storage Transfer Service you can schedule incremental syncs from on-premises to Cloud Storage.

If you are looking for a simpler workflow, gsutil or the recent gcloud storage tools may offer a good alternative to run copies (and perhaps scheduling them using crontab).

Take a look at this blog post, as it explores some alternatives.

Bruno Volpato
  • 1,382
  • 10
  • 18
  • 1
    Thank you for the reply. Probably i will need some transformation This is why i was thinking about dataflow. I have two way what's the better? 1. Copy files into GCS with Storage Transfer Service, then read files from GCS and use dataflow 2. Use Dataflow direclty. But How? I can not find any python examples to start with. – alex-mont Oct 10 '22 at 15:29
  • 1
    Probably makes sense to copy to Cloud Storage first, and then apply transformations. Dataflow runs workers on Google Cloud, and wouldn't easily reach your on-premises files. Once on Cloud Storage, you could try to leverage [Google-provided Templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates) or create your own Beam pipeline. It is trivial, but the WordCount example is a very good starter as an example that reads, transforms data and writes to Cloud Storage. Take a look [here](https://beam.apache.org/get-started/wordcount-example/). – Bruno Volpato Oct 10 '22 at 15:52