0

I need to extract the list of files present inside a tar file stored in S3 bucket using AWS CLI. I should not download the file and extract it to list the file. I only need the file list. I tried the select-object-content S3 api command. But it is throwing random errors.

The command I tried is

aws s3api select-object-content --bucket my-temp-files --key S3_temp_compression_test/20230216.tar --expression "select s from S3Object s where s.key like '%.tar'" --expression-type "SQL" --input-serialization '{"CSV": {"FileHeaderInfo": "Use"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' | tar -xOf - | tr ' ' '\n'

Suggest whether there are any other options that I can approach.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470

1 Answers1

0

SelectObjectContent only understands JSON, CSV, or Apache Parquet and GZIP or BZIP2 or Snappy. It does not support .tar(.gz) . Therefore it simply is not an option here. And it would not give you the list of files but their content. Simply the entirely wrong tool for the job.

What you could theoretically do is use byte range fetches to e.g. fetch the bytes with the file header out of the tar file. But since you also have the tar gzipped that is also not really an option, see https://unix.stackexchange.com/a/117356/175925 .

You need to either download the entire file to inspect its contents (see https://stackoverflow.com/a/56086961/2442804) or store the list of files separately somewhere to begin with.

luk2302
  • 55,258
  • 23
  • 97
  • 137