5

AWS Documentation mentions: The maximum length of a record in the input or result is 1 MB. https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html

However, I'm even able to fetch 2.4GB result on running an S3 Select query through a python lambda, and have seen people working with even more huge result size.

Can someone please highlight the significance of 1 MB mentioned in AWS documentation and what does it mean?

Aman Gupta
  • 59
  • 2
  • 4

2 Answers2

2

Background:

I recently faced the same question regarding the 1 MB limit. I'm dealing with a large gzip compressed csv file and had to figure out, if S3 Select would be an alternative to processing the file myself. My research makes me feel the author of the previous answer misunderstood the question.

The 1 MB limit referenced by the current AWS S3 Select documentation is referring to the record size:

... The maximum length of a record in the input or result is 1 MB.

The SQL Query is not the input (it has a lower limit though):

... The maximum length of a SQL expression is 256 KB.

Question Response:

I interpret this 1 MB limit the following way:

  1. One row in the queried CSV file (uncompressed input) can't use more than 1 MB of memory
  2. One result record (result row returned by S3 select) also can't use more than 1 MB of memory

To put this in a practical perspective, the following questions discussed the string size in bytes for Python. I'm using an UTF-8 encoding.

  1. This means len(row.encode('utf-8')) (string size in bytes) <= 1024 * 1024 bytes for each csv row represented as UTF-8 encoded string of the input file.
  2. It again means len(response_json.encode('utf-8')) <= 1024 * 1024 bytes for each returned response record (in my case the JSON result).

Note:

In my case, the 1 MB limit works fine. However, this depends a lot on the amount of data in your input (and potentially extra, static columns you might add via SQL).

If the limit 1MB is exceeded and you want to query files without a data base solution involved, using the more expensive AWS Athena might be a solution.

kf06925
  • 371
  • 3
  • 5
1

Could you point us to part of documentation which talking about this 1mb?

I have never seen 1 MB limit. Downloading of object is just downloading, and you can download almost unlimited file.

AWS Uplaods files with multipart upload and it has limits up to Terabytes for object and up to Gigabytes for objects part

enter image description here

Docs is here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html


Response to the question

As per comment of author below my post:

Limit described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/querying-glacier-archives.html

This docs refers to query for archived objects. So you can do some query on data, without collecting it from the Glacier.

And input query cannot exceed 1MB. Output of that query cannot exceed 1MB.

  • Input is SQL query
  • Output is files list.

Find more info here: https://docs.aws.amazon.com/amazonglacier/latest/dev/s3-glacier-select-sql-reference-select.html

So this limit is not for files but for SQL-like queries.

Daniel Hornik
  • 1,957
  • 1
  • 14
  • 33
  • The limit is for S3 select as written [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/querying-glacier-archives.html). Maybe you could expand on the limit? – Marcin Jan 13 '21 at 22:35
  • 1
    This docs refers to query for archived objects. So you can do some query on data, without collecting it from the `Glacier`. And input query cannot exceed 1MB. Output of that query cannot exceed 1MB. `Input` is SQL query, `Output` is files list. Find more info here: https://docs.aws.amazon.com/amazonglacier/latest/dev/s3-glacier-select-sql-reference-select.html – Daniel Hornik Jan 13 '21 at 22:45
  • 1
    Here's an AWS documentation which mentions this limit for using S3 Select query: https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html Adding to what you answered, **Input query cannot exceed 1MB. Output of that query cannot exceed 1MB. Input is SQL query, Output is files list** This is exactly what my question is about. I'm able to get an output of 2.4GB upon running an input S3 Select SQL query, but as you and the doc mentioned, we shouldn't be able to get more than 1MB of output. Can you please elaborate on this? – Aman Gupta Jan 14 '21 at 11:37
  • Can you show us, how did you measure 2.4GB? If We assume one path is 60byes long. It is around 400000000 files as a result. How do you measure your output. Could you describe step by step what you are doing? – Daniel Hornik Jan 14 '21 at 11:43
  • I uploaded a json file of 2.4GB containing 5 million lines, each line containing a json object. Then from my lambda, I did a "select *" on this object and got the result. – Aman Gupta Jan 18 '21 at 10:36