0

I am confused about the S3 select pricing regarding data returned and data scanned. If I want to access something at an index in a json file, does it still scan the entire file and the data scanned counts for the entire file size? Suppose I use the following query on this example file:

select * from S3Object[*].place1[*].Houses[*]

{
    "place1": [
        "Houses": [
            {
                "date": "1777-06-30",
                "price": "445000.0"
            },
            {
                "date": "2014-10-31",
                "price": "495000.0"
            }
        ],
        "Apartments": [
            {
                "date": "1777-06-30",
                "price": "445000.0"
            },
            {
                "date": "2014-10-31",
                "price": "495000.0"
            }
        ]
    ]
}


Would it charge data scanned for the entire file or would it be reduced because I am accessing the Houses array directly?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Nock
  • 3
  • 1
  • How is your data stored? Is it in a CSV file or a columnar format (eg Parquet, ORC)? Is it compressed? See also: [amazon web services - How S3 select pricing works? What is data returned and scanned in s3 select means - Stack Overflow](https://stackoverflow.com/questions/53001443/how-s3-select-pricing-works-what-is-data-returned-and-scanned-in-s3-select-mean) – John Rotenstein Dec 26 '20 at 21:05
  • @JohnRotenstein It is stored in an actual .json file. I understand that CSV files would need to search everything however, I thought you could access json data directly to the index you input and disregard the rest? Does this mean it only scans the size of the output size in this case? – Nock Dec 26 '20 at 22:37

1 Answers1

0

JSON data would need to be scanned in its entirety to provide the output. This is because there is no concept of an index or a block range on a JSON file. (An index points to where data is stored, and a block range tracks the min/max value of data in a storage block.)

JSON is fine for data interchange, but is not designed for efficient storage.

You could, however, compress the file to reduce the storage cost. It is possible that this would also reduce the scan cost (as is the case for Amazon Athena), but I could not find any information to confirm this for S3 Select.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470