1

I need a system to read a S3 bucket for analysis. The bucket is broken down to Year/Month/Day/Hour where each Hour folder has a lot of zipped files that amount to over 2GBs.

Is this something to be scripting in Python with boto3? Looking for any general direction.

GBL_DO
  • 49
  • 1
  • 6
  • You should visit this and look at third most upvoted answer: https://stackoverflow.com/questions/30249069/listing-contents-of-a-bucket-with-boto3 – user4700847 Jun 12 '19 at 23:08
  • 2
    What is your _actual_ requirement? That is, what type of "analysis" do you wish to perform on the files? What is the format of the files inside the zip files? I ask this because Amazon Athena could be a very easy way to process the files, but it depends on the format. Feel free to edit your question to provide more information. – John Rotenstein Jun 13 '19 at 00:16
  • @JohnRotenstein Athena looks like a great solution. The files are gzip format in each folder. Will this be an issue? – GBL_DO Jun 13 '19 at 06:11
  • Athena loves reading gzip files. It makes things faster and cheaper! Please answer my above questions and we will be able to assist further. – John Rotenstein Jun 13 '19 at 07:12
  • @JohnRotenstein Awesome to hear that. Requirement is to count distinct values of a customer_id and group them by item_id across all files. – GBL_DO Jun 13 '19 at 19:04

1 Answers1

1

Amazon Athena can run SQL-like queries across multiple files stored in Amazon S3.

The files can be compressed with gzip. In fact, Athena will run faster and cheaper on compressed files because you are only charged for the amount of data scanned from disk.

All files in a given folder (path) in Amazon S3 must be in the same format. For example, if they are CSV files in gzip format, all the files must have the same number of columns in the same order.

You can then use CREATE TABLE in Amazon Athena, which defines the columns in the data files and the location of the data. This is the hardest part, because you have to get the format correctly defined.

Then, you can run SQL SELECT commands to query the data, which will apply to all files in the designated folder.

In future, if you want to add or remove data, simply update the contents of the folder. The SELECT command always looks at the files in the folder at the time that the command is run.

Given your requirement of "count distinct values of a customer_id and group them by item_id across all files", it would be something like:

SELECT
  item_id,
  COUNT(DISTINCT customer_id)
FROM table
GROUP BY 1
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470