Where do I begin when trying to develope a data pipeline from an FTP server to an AWS S3 bucket?

Question

I'm looking at creating a pipeline between an FTP server and an AWS S3 bucket. With the capability of monitoring the FTP server for new files. I would like to program most to all of it with Python. Where do I begin?

Learn `boto3`. You're asking a really broad question here, but if you start by familiarizing yourself with `boto3` to interact with S3 you'll be on a good path to getting started. — vielkind, Aug 22 '18 at 20:13
Your question doesn't fit here particularly well, but is a worthy question. Please check ["Which site?"](https://meta.stackexchange.com/questions/129598/which-computer-science-programming-stack-exchange-do-i-post-in). This is in the realm of tutorial guidance, which is too broad for Stack Overflow. — Prune, Aug 22 '18 at 20:16
in https://github.com/aws-samples/data-pipeline-samples/blob/master/samples/ShellCommandWithFTP/pipeline.json is an example that uses `ShellCommandActivity` for the FTP — ralf htp, Aug 22 '18 at 21:17

score 0 · Answer 1 · answered Aug 22 '18 at 20:41

First try to manually configure a Data Pipeline in aws console, because the service is still buggy and you want to start with the easiest way:

Access your account in AWS Console and go to Data Pipeline service.
Hit Get Started button which should open Create Pipeline menu.
Name the pipeline, choose in Source field "Build using Architect", choose in Schedule "on pipeline activation".
(Optional) Highly suggest you to keep Logging as Enabled and provide an S3 bucket to keep your pipeline logs, so you can troubleshoot later.
Hit Activate button which should open the UI constructor of your pipeline.
On the UI diagram hit Add button and add a ShellCommandActivity where you will specify the bash command that reads the file/folder from your FTP. Because Data Pipeline doesn't support FTP as a datasource, you'll have to do it via this bash activity.
On the right menu click "Add optional field..." and add Command field where you'll write your bash command that reads from FTP.
Again hit "Add and optional field..." and choose Output which will create a DataNode.
Click on that data node box and in right menu "Add an optional field..." Directory Path where you'll write your target S3 bucket.
Save and Activate the pipeline.

After you do this and make your pipeline working, you can involve Python. Not sure yet how you're planning to use Python. If you're thinking about doing the Data Pipeline configuration in a Python script, then checkout the Boto3 documentation for Data Pipeline.

Where do I begin when trying to develope a data pipeline from an FTP server to an AWS S3 bucket?

1 Answers1