How to read data from 100k+ files from S3 using S3 select and combine it in zip file to provide it to client in downloaded format

Question

I have been working on a large download. My requirement is to read through 100k+ files (in gzip JSON format) on S3 using S3 Select to filter and stream the data in a downloaded format to the client.

I have written 2 services:

Client interaction(Controller)
S3 interaction (S3 Interactor)

When the client hits on the download button, the controller calls S3 Interactor for data, but after a few mins, the connection between services breaks. I am not sure how to keep the connection alive for, say, 30 minutes because the data can be in TBs.

had a similar problem where the connection would time out. I ended up grabbing byte ranges in consecutive requests instead trying to dump the entire file - [this might help you](https://stackoverflow.com/questions/70625366/streaming-files-from-aws-s3-with-nodejs) — about14sheep, Apr 09 '23 at 02:20
Are all the files in the same format? I wonder whether Amazon Athena would be a suitable alternative to S3 Select, since it can scan multiple files simultaneously and run SQL across them. However, 100k+ files might be too much for Athena. — John Rotenstein, Apr 09 '23 at 08:31
@JohnRotenstein files are in the same format. I tried with Athena, and it works fine most of the time with 20k files but breaks when reading around 40k. That is why, I was going via the standard approach. — Taslim Arif, Apr 10 '23 at 03:57
@about14sheep I think, S3 Select does not support the byte range functionality for gzip JSON formatted data. — Taslim Arif, Apr 10 '23 at 04:05

How to read data from 100k+ files from S3 using S3 select and combine it in zip file to provide it to client in downloaded format

0 Answers0