-1

We have a requirement to get a .csv files from a bucket which is a client location (They would provide the S3 bucket info and other information required). Every day we need to pull this data into our S3 bucket so we can process it further. Please suggest the best way/technology that we can use to achieve the result.

I am planning to do it by Python boto (or Pandas or Pyspark) or Spark; reason being, once we get this data it might be processed further.

halfer
  • 19,824
  • 17
  • 99
  • 186
Raghunandan Sk
  • 308
  • 1
  • 3
  • 10
  • Possible duplicate of [AWS S3 copy files and folders between two buckets](https://stackoverflow.com/questions/9295587/aws-s3-copy-files-and-folders-between-two-buckets) – Chacko Jan 05 '18 at 06:22

2 Answers2

0

You can try the S3 cross account object copy using the S3 copy option. This is more secure and the suggested one. Please go through the below link for more details. It also works for same account different buckets. After copying then you can trigger some lambda function with custom code(python) to do the processing of the .csv files.

How to copy Amazon S3 objects from one AWS account to another by using the S3 COPY operation

Vaisakh PS
  • 1,181
  • 2
  • 10
  • 19
  • Thanks for the reply. I was looking on a option to setup ETL's each day to get this data . Could this also be achieved using this method ? – Raghunandan Sk Jan 03 '18 at 04:48
  • This will do the Extract. Transform and Load you can take care of using Lambda Function or EC2 server. – Vaisakh PS Jan 03 '18 at 05:23
0

If your customer keeps the data in an s3 bucket to which your account has been granted access to it, then it should be possible to use the .csv files as a direct source of data for a spark job. Use the s3a://theirbucket/nightly/*.csv as the RDD source, and save it to s3a://mybucket/somewhere, ideally in a format other than CSV (Parquet, ORC, ...). This lets you do some basic transformation of the format into one easier to work with.

If you just want the raw CSV files, that S3 Copy operation is what you need, as it copies the data within S3 itself (6+MiB/s if in the same S3 location), and not needing any of your own VMs involved.

stevel
  • 12,567
  • 1
  • 39
  • 50