what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils

Question

Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.

One file or lots of files? How big are the files (and how many rows)? How often will you be doing it? — John Rotenstein, Nov 26 '18 at 18:08
@JohnRotenstein The folders are named date wise with each directory having around 10 files which are compressed. When extracted they would come around 300mb each ( ~2 lakh records ). Atleast for now I am thinking of pulling it once a day may be. — chidori, Nov 27 '18 at 10:51

score 1 · Accepted Answer · answered Nov 26 '18 at 17:45

If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.

If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.

If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

score 1 · Answer 2 · answered Nov 27 '18 at 16:23

Some options:

Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.

Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.

what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils

2 Answers2