Background:
I recently faced the same question regarding the 1 MB limit. I'm dealing with a large gzip compressed csv file and had to figure out, if S3 Select would be an alternative to processing the file myself. My research makes me feel the author of the previous answer misunderstood the question.
The 1 MB limit referenced by the current AWS S3 Select documentation is referring to the record size:
... The maximum length of a record in the input or result is 1 MB.
The SQL Query is not the input (it has a lower limit though):
... The maximum length of a SQL expression is 256 KB.
Question Response:
I interpret this 1 MB limit the following way:
- One row in the queried CSV file (uncompressed input) can't use more than 1 MB of memory
- One result record (result row returned by S3 select) also can't use more than 1 MB of memory
To put this in a practical perspective, the following questions discussed the string size in bytes for Python. I'm using an UTF-8 encoding.
- This means
len(row.encode('utf-8'))
(string size in bytes) <= 1024 * 1024 bytes for each csv row represented as UTF-8 encoded string of the input file.
- It again means
len(response_json.encode('utf-8'))
<= 1024 * 1024 bytes for each returned response record (in my case the JSON result).
Note:
In my case, the 1 MB limit works fine. However, this depends a lot on the amount of data in your input (and potentially extra, static columns you might add via SQL).
If the limit 1MB is exceeded and you want to query files without a data base solution involved, using the more expensive AWS Athena might be a solution.