4

We have an EMR Studio that has an S3 default bucket set, i.e. s3://OurBucketName/Subdirectory/work, and within which we've created a Workspace that is attached to an EC2 cluster running emr-6.10.0 with the following apps installed:

  • Hadoop 3.3.3
  • Hive 3.1.3
  • Hue 4.10.0
  • JupyterEnterpriseGateway 2.6.0
  • JupyterHub 1.5.0
  • MXNet 1.9.1
  • Pig 0.17.0
  • Presto 0.278
  • Spark 3.3.1
  • TensorFlow 2.11.0
  • Zeppelin 0.10.1

We can view, read, and write files from within the (bash) Terminal in our Workspace, which appears to contain a copy of everything inside the s3://OurBucketName/Subdirectory/work S3 bucket at the /home/notebook/work location. That said, we cannot read or write files from within any of the consoles or notebooks.

We have tried a number of filepaths, including relative (~/data/filename.csv), absolute (/home/notebook/work/ProjectName/data/filename.csv), S3 (s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv), EMR Shareable Link (https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv), and EMR Download Link (https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>).

The target file definitely exists, can be seen in the file browser on the lefthand side, and can be opened/read/modified from within the Terminal or by any scripts executed by it.

Running the following

offices <- read.csv("~/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '~/data/filename.csv': No such file or directory,

and running the following

offices <- read.csv("/home/notebook/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '/home/notebook/work/ProjectName/data/filename.csv': No such file or directory,

and running the following

offices <- read.csv("s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = "."),

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file 's3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv': No such file or directory;

whereas running either

offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")

or

offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>", header = TRUE, sep = ",", quote = "\"",dec = ".")

run seemingly without error, however, appear to read a blank html file because running

summary(offices)

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

X..DOCTYPE.html. Length:28 Class :character Mode :character .

Lastly, it appears that the associated (Python, PySpark, Spark, or SparkR) kernels are running in a container somewhere on one of the mnt drives, because running

getwd

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

/mnt1/yarn/usercache/livy/appcache/application_1678485106748_0005/container_1678485106748_0005_01_000001,

however, running

setwd("/home/notebook")

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in setwd(\"/home/notebook\"): cannot change working directory".

1 Answers1

1

We don't use EMR Studio, but rather use SageMaker Studio following this set up: https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster.html

But I have seen your problem as well. In particular, in my case, I was trying to read some data from S3 path s3://bucket/path/to/file and it would be telling me this does not exist even though I was dead sure it did exist (no typo, etc.). I swapped s3 for s3a and got a more informative error: That in fact the EMR cluster's EC2 roles does not have permissions.

So I think the easiest thing here for you to verify whether that's the case in your case as well would be to SSH onto the leader node (e.g. using Session Manager) and try to read s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv from there. But if you are sure that S3 path exists, then I bet your case is the same as mine. You could also first try the thing I did, use the "s3a://..." path which uses a different (older) reader, but which should hopefully give you a more informative exception.

  • Thanks, @jakub-kaplan. We tried various things based on your comment. Tried the s3a path, but didn't receive more informative errors. Confirmed the ability to see the s3 path by SSHing into leader node. Went through all EC2 role specifics, and added any we could think of. Strange thing is that we can see the S3 bucket contents from the (bash) terminal app inside the workspace at /home/notebook, so it appears to not be the workspace permissions after all? It's almost as if Spark itself can't see S3 buckets. (On that note, we installed Livy 0.7.1, but that didn't correct the issue either.) – dragonscience Mar 30 '23 at 18:20
  • @dragonscience I see, thanks for the follow up. Sorry I don't really have any other pointers for this, it's super weird you can see it from the leader node command line but not from Spark - that should be using the same IAM Role (EMR Cluster EC2 role). One last desperate attempt you could make is by trying to see relevant logfiles in CloudTrail. Sometimes, when I get completely useless message from EMR, I can see more information by looking at the S3 API calls in CloudTrail - not always, but if you are desperate, that's my last idea unfortunately. – Jakub Kaplan Mar 30 '23 at 22:18