Amazon Athena can run SQL-like queries across multiple files stored in Amazon S3.
The files can be compressed with gzip. In fact, Athena will run faster and cheaper on compressed files because you are only charged for the amount of data scanned from disk.
All files in a given folder (path) in Amazon S3 must be in the same format. For example, if they are CSV files in gzip format, all the files must have the same number of columns in the same order.
You can then use CREATE TABLE in Amazon Athena, which defines the columns in the data files and the location of the data. This is the hardest part, because you have to get the format correctly defined.
Then, you can run SQL SELECT commands to query the data, which will apply to all files in the designated folder.
In future, if you want to add or remove data, simply update the contents of the folder. The SELECT
command always looks at the files in the folder at the time that the command is run.
Given your requirement of "count distinct values of a customer_id and group them by item_id across all files", it would be something like:
SELECT
item_id,
COUNT(DISTINCT customer_id)
FROM table
GROUP BY 1