Access large BDF and EDF files stored on S3 from Sagemaker by URL and read them with mne library

Question

I have a bucket on S3 and a SageMaker notebook. I try to get access to rather large BDF and EDF files (1-2 GB), stored on the S3 bucket, without uploading them to SageMaker volume. I also need to access these files by URL as the EDF processing function mne.io.read_raw_edf receives the absolute path to a file as an input.

The S3 bucket is in the same region as the Sagemaker Notebook Instance. The IAM role associated with the notebook instance is given permission to access the S3 bucket.

First, I tried to use the approach from this question, that describes how to read .csv files. Although in my case it worked well with .csv files, it failed with .edf ones.

import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

bucket='my-bucket'
key = 'train.edf'
data_location = 's3://{}/{}'.format(bucket, key)

mne.io.read_raw_edf(data_location)

When I execute this code, I receive the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/SageMaker/s3:/my-bucket/train.edf'

Here I face a path stacking that was not done by me. I don't quite understand why pd.read_csv read paths normally, unlike mne.io.read_raw_edf, which seems to stack the local path and the server one.

Then I found an answer to a very similar question but faced a very similar problem with the stacked paths.

import boto3

bucket_location = boto3.client('s3').get_bucket_location(Bucket=bucket)
object_url = "https://s3-{0}.amazonaws.com/{1}/{2}".format(
    bucket_location['LocationConstraint'],
    bucket,
    key)
object_url

'https://s3-us-west-2.amazonaws.com/my-bucket/train.edf'

Here we can see that the path is stored normally.

mne.io.read_raw_edf(object_url)

When I execute this code, I receive the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/SageMaker/https:/s3-us-west-2.amazonaws.com/my-bucket/train.edf'

mne.io.read_raw_edf performs the weird stacking again.

At last, I tried to follow the approach described in this article.

s3 = boto3.client("s3", 
                  region_name='us-west-2', 
                  aws_access_key_id='access_key_id', 
                  aws_secret_access_key='secret_access_key')

share_url = s3.generate_presigned_url(ClientMethod="get_object", 
                                      ExpiresIn=3600,
                                      Params={"Bucket": bucket, "Key": key})
share_url

'https://my-bucket.s3.amazonaws.com/train.edf?AWSAccessKeyId=access_key_id&Signature=signature&Expires=1616777253'

The path seems to be normal again.

mne.io.read_raw_edf(share_url)

NotImplementedError: Only EDF files are supported by read_raw_edf, got edf?awsaccesskeyid=access_key_id&
signature=signature&expires=1616777253

But here I got another weird mne.io.read_raw_edf behavior. No more stacking but the path was cropped.

I assume that it could be mne.io.read_raw_edf problem itself, but I have never faced anything like this outside Amazon products.

Does it make sense to get access to the BDF and EDF files by URLs or is it better to upload the files to the SageMaker volume? I apologize if this question seems naive. I've already spent a couple of days solving this problem, but I need to sort it out as soon as possible as Amazon takes money for every hour when the notebook is active.

Checking the MNE source, it looks like they read raw files by running `input_fname = os.path.abspath(input_fname)`, so I don't think they support loading remote data. Your only option might be to download the file to the SageMaker hard drive and then read it locally — Adel Hassan, Oct 10 '21 at 23:43

Access large BDF and EDF files stored on S3 from Sagemaker by URL and read them with mne library

0 Answers0