I have a bucket on S3 and a SageMaker notebook. I try to get access to rather large BDF and EDF files (1-2 GB), stored on the S3 bucket, without uploading them to SageMaker volume.
I also need to access these files by URL as the EDF processing function mne.io.read_raw_edf
receives the absolute path to a file as an input.
The S3 bucket is in the same region as the Sagemaker Notebook Instance. The IAM role associated with the notebook instance is given permission to access the S3 bucket.
First, I tried to use the approach from this question, that describes how to read .csv
files. Although in my case it worked well with .csv
files, it failed with .edf
ones.
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket='my-bucket'
key = 'train.edf'
data_location = 's3://{}/{}'.format(bucket, key)
mne.io.read_raw_edf(data_location)
When I execute this code, I receive the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/SageMaker/s3:/my-bucket/train.edf'
Here I face a path stacking that was not done by me. I don't quite understand why pd.read_csv
read paths normally, unlike mne.io.read_raw_edf
, which seems to stack the local path and the server one.
Then I found an answer to a very similar question but faced a very similar problem with the stacked paths.
import boto3
bucket_location = boto3.client('s3').get_bucket_location(Bucket=bucket)
object_url = "https://s3-{0}.amazonaws.com/{1}/{2}".format(
bucket_location['LocationConstraint'],
bucket,
key)
object_url
'https://s3-us-west-2.amazonaws.com/my-bucket/train.edf'
Here we can see that the path is stored normally.
mne.io.read_raw_edf(object_url)
When I execute this code, I receive the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/SageMaker/https:/s3-us-west-2.amazonaws.com/my-bucket/train.edf'
mne.io.read_raw_edf
performs the weird stacking again.
At last, I tried to follow the approach described in this article.
s3 = boto3.client("s3",
region_name='us-west-2',
aws_access_key_id='access_key_id',
aws_secret_access_key='secret_access_key')
share_url = s3.generate_presigned_url(ClientMethod="get_object",
ExpiresIn=3600,
Params={"Bucket": bucket, "Key": key})
share_url
'https://my-bucket.s3.amazonaws.com/train.edf?AWSAccessKeyId=access_key_id&Signature=signature&Expires=1616777253'
The path seems to be normal again.
mne.io.read_raw_edf(share_url)
NotImplementedError: Only EDF files are supported by read_raw_edf, got edf?awsaccesskeyid=access_key_id&
signature=signature&expires=1616777253
But here I got another weird mne.io.read_raw_edf
behavior. No more stacking but the path was cropped.
I assume that it could be mne.io.read_raw_edf
problem itself, but I have never faced anything like this outside Amazon products.
Does it make sense to get access to the BDF and EDF files by URLs or is it better to upload the files to the SageMaker volume? I apologize if this question seems naive. I've already spent a couple of days solving this problem, but I need to sort it out as soon as possible as Amazon takes money for every hour when the notebook is active.