We have an EMR Studio that has an S3 default bucket set, i.e. s3://OurBucketName/Subdirectory/work
, and within which we've created a Workspace that is attached to an EC2 cluster running emr-6.10.0
with the following apps installed:
- Hadoop 3.3.3
- Hive 3.1.3
- Hue 4.10.0
- JupyterEnterpriseGateway 2.6.0
- JupyterHub 1.5.0
- MXNet 1.9.1
- Pig 0.17.0
- Presto 0.278
- Spark 3.3.1
- TensorFlow 2.11.0
- Zeppelin 0.10.1
We can view, read, and write files from within the (bash) Terminal in our Workspace, which appears to contain a copy of everything inside the s3://OurBucketName/Subdirectory/work
S3 bucket at the /home/notebook/work
location. That said, we cannot read or write files from within any of the consoles or notebooks.
We have tried a number of filepaths, including relative (~/data/filename.csv
), absolute (/home/notebook/work/ProjectName/data/filename.csv
), S3 (s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv
), EMR Shareable Link (https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv
), and EMR Download Link (https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>
).
The target file definitely exists, can be seen in the file browser on the lefthand side, and can be opened/read/modified from within the Terminal or by any scripts executed by it.
Running the following
offices <- read.csv("~/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '~/data/filename.csv': No such file or directory
,
and running the following
offices <- read.csv("/home/notebook/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '/home/notebook/work/ProjectName/data/filename.csv': No such file or directory
,
and running the following
offices <- read.csv("s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
,
from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns
[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file 's3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv': No such file or directory
;
whereas running either
offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")
or
offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>", header = TRUE, sep = ",", quote = "\"",dec = ".")
run seemingly without error, however, appear to read a blank html file because running
summary(offices)
from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns
X..DOCTYPE.html. Length:28 Class :character Mode :character
.
Lastly, it appears that the associated (Python, PySpark, Spark, or SparkR) kernels are running in a container somewhere on one of the mnt
drives, because running
getwd
from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns
/mnt1/yarn/usercache/livy/appcache/application_1678485106748_0005/container_1678485106748_0005_01_000001
,
however, running
setwd("/home/notebook")
from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns
[1] "Error in setwd(\"/home/notebook\"): cannot change working directory"
.