File not found when running Spark Job with input from Google Storage Bucket

Question

I am running a job on a Google Cloud Dataproc cluster which takes one parameter -- path to the input file. This file is stored in a Google Cloud Storage bucket. I get a FileNotFoundException (trace below). Why would that be?

gcloud dataproc jobs submit spark --cluster cluster-1 --class MST.ComputeMST \
    --jars gs://dataproc-211700eb-83ed-456d-a67e-98af9e6fa02d-us/ComputeMST.jar \
    -- gs:///dataproc-211700eb-83ed-456d-a67e-98af9e6fa02d-us/input.txt

Job [8b193fcd-1350-462b-ae11-373333e868fe] submitted.
Waiting for job output...
17/05/16 05:06:02 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
number of runs = 0
Exception in thread "main" java.io.FileNotFoundException: gs:/dataproc-211700eb-83ed-456d-a67e-98af9e6fa02d-us/input.txt (No such file or directory)
  at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at java.io.FileInputStream.<init>(FileInputStream.java:93)
  at java.io.FileReader.<init>(FileReader.java:58)
  at MST.ComputeMST.main(ComputeMST.java:670)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [8b193fcd-1350-462b-ae11-373333e868fe] entered state [ERROR] while waiting for [DONE].

According to this documentation Google Cloud Dataproc has Google Storage Connector installed by default: https://cloud.google.com/dataproc/docs/connectors/cloud-storage — Grigory Yaroslavtsev, May 16 '17 at 05:34
Can you try using `--files ` to copy the file to the working directory of all executors ? — Sanket_patil, May 16 '17 at 09:57

Igor Dvorzhak · Answer 1 · 2018-05-05T16:07:12.357

0

Even though GCS connector is installed by default on Cloud Dataproc cluster, you can not use it from your job through java.io.FileReader interface.

To access GCS objects through GCS connector you need to use Hadoop's FileSystem interface.

edited May 05 '18 at 16:07

answered May 05 '18 at 07:15

Igor Dvorzhak

4,360
3
17
31

File not found when running Spark Job with input from Google Storage Bucket

1 Answers1