How to grep into files stored in S3

Question

Do anybody know how to perform grep on S3 files with aws S3 directly into the bucket? For example I have FILE1.csv, FILE2.csv with many rows and want to look for the rows that contain string JZZ

aws s3 ls --recursive s3://mybucket/loaded/*.csv.gz | grep ‘JZZ’

score 22 · Answer 1 · edited May 23 '17 at 12:25

22

The aws s3 cp command can send output to stdout:

aws s3 cp s3://mybucket/foo.csv - | grep 'JZZ'

The dash (-) signals the command to send output to stdout.

See: How to use AWS S3 CLI to dump files to stdout in BASH?

edited May 23 '17 at 12:25

Community

1
1

answered Dec 20 '16 at 05:09

John Rotenstein

241,921
22
380
470

1

Doesn't that incur data transfer charges for downloading the entire file? – Andrew Aug 24 '22 at 11:56
@Andrew Data Transfer is charged for _sending data from AWS to the Internet_. If this command was run on a computer on the Internet, the Data Transfer charges would apply. If it is run on an Amazon EC2 instance in the same Region as the Amazon S3 bucket, then Data Transfer would _not_ apply. – John Rotenstein Aug 24 '22 at 12:20
Thanks John. If you wanted to grep 10TB of compressed files in S3, would you just uncompress and grep on the EC2 instance? Any advantage to using a lambda function and run it as an S3 Batch Operation? – Andrew Aug 24 '22 at 12:50
@Andrew Please create a new Question rather than asking via a comment on an old question. Please add lots of details to your new Question. – John Rotenstein Aug 24 '22 at 12:56

score 12 · Answer 2 · answered Oct 17 '17 at 17:47

You can also use the GLUE/Athena combo which allows you to execute directly within AWS. Depending on data volumes, queries' cost can be significant and take time.

Basically

Create a GLUE classifier that reads byline
Create a crawler to your S3 data directory against a database (csvdumpdb) - it will create a table with all the lines across all the csvs found
Use Athena to query, e.g.

select "$path",line from where line like '%some%fancy%string%'
and get something like

$path line

s3://mybucket/mydir/my.csv "some I did find some,yes, "fancy, yes, string"

Saves you from having to run any external infrastructure.

This worked like a charm. Thank you! In my world downloading prod data from s3 to my computer is not an option. This was an easy way around that. — james, Apr 07 '21 at 20:03

score 11 · Answer 3 · answered Sep 13 '18 at 12:02

You can do it locally with the following command:

aws s3 ls --recursive s3://<bucket_name>/<path>/ | awk '{print $4}' | xargs -I FNAME sh -c "echo FNAME; aws s3 cp s3://<bucket_name>/FNAME - | grep --color=always '<regex_pattern>'"

Explanation: The ls command generates a list of files then we select the file name from the output and for each file (xargs command) download the file from S3 and grep the output.

I don't recommend this approach if you have to download a lot of data from S3 (due to transfer costs). You can avoid the costs for internet transfer though if you run the command on some EC2 instance that is located in a VPC with an S3 VPC endpoint attached to it.

score 0 · Answer 4 · edited Jul 26 '22 at 18:29

There is a way to do it thru the aws command line but will require some tools and fancy pipes. Here are some examples

S3:

aws s3api list-objects --bucket my-logging-bucket --prefix "s3/my-events-2022-01-01" | jq -r '.Contents[]| .Key' | sort -r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} -

Cloudfront:

aws s3api list-objects --bucket my-logging-bucket --prefix "cloudfront/blog.example.com/EEQEEEEEEEEE.2022-01-01" |jq -r '.Contents[]| .Key' | sort -r | xargs -I{} aws s3 cp s3://my-logging-bucket/{} - | zgrep GET

The "sort -r" just reverses the order so it shows the newest objects first. You can omit that if you want to look at them in chronological order.

How to grep into files stored in S3

4 Answers4

Linked