4

It is possible to read parquet files from S3 as shown here or here.

I am working with S3 access points. Having S3 access point ARN is it possible to read parquet files from it?

I am trying with the following sample code:

import s3fs
import pyarrow.parquet as pq

S3_ACCESS_POINT_ARN = "..."

s3_filesystem = s3fs.S3FileSystem()
s3_file_uri = f"{S3_ACCESS_POINT_ARN}/examples/example1.parquet"
example1_df = pq.ParquetDataset(s3_file_uri, s3_filesystem).read_pandas().to_pandas()

Executing it results with:

ParamValidationError: Parameter validation failed:
Invalid bucket name S3_ACCESS_POINT_ARN: Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

I have also tried replacing / with : in S3_ACCESS_POINT_ARN which results in:

PermissionError: AccessDenied

Finally I tried using:

pq.read_table(S3_ACCESS_POINT_ARN, s3_filesystem).to_pandas()

which resulted in:

OsError: Passed non-file path: S3_ACCESS_POINT_ARN

It is worth mentioning that there is no access issues with reading files from this access point, with the code below working:

import boto3

S3_ACCESS_POINT_ARN = "..."

s3 = boto3.resource('s3')
bucket = s3.bucket(S3_ACCESS_POINT_ARN)
bucket.download_file(f"{S3_ACCESS_POINT_ARN}/examples/example1.parquet", "/tmp/examples/example1.parquet")
example1_df = pq.read_table("/tmp/examples/example1.parquet").to_pandas()

UPDATE: S3 access point does not allow non top-level list objects operations:

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

But I cannot see any parameter that would allow pyarrow to treat the parquet file as a single file, which could potentially avoid having this issue.

Krzysztof Słowiński
  • 6,239
  • 8
  • 44
  • 62
  • I have the same issue and the error message says: `Invalid bucket name "arn:aws:s3:us-east-1:291160143014:accesspoint"` so the actual access point name is stripped from the ARN. I suspect pyarrow only expects a bucket path and does not support access points yet. – taras Oct 29 '20 at 10:38
  • https://issues.apache.org/jira/browse/ARROW-9669 – taras Oct 29 '20 at 20:45

1 Answers1

0

You have to use the S3 Access point Alias, not S3 Access Point ARN.

Taha Khan
  • 3
  • 2
  • 2
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 29 '22 at 13:03