0

I'm running into an issue when building the spark-tensorflow-connector on GCP's Dataproc.

The problem occurs when one of the test fails because of

java.lang.IllegalStateException: LocalPath /tmp/spark-connector-propagate7442350445858279141 already exists. SaveMode: ErrorIfExists

I believe the issue is related to this part of the LocalWiteSuite.scala script:

"Propagate" should {
   "write data locally" in {
     // Create a dataframe with 2 partitions
     val rdd = spark.sparkContext.parallelize(testRows, numSlices = 2)
     val df = spark.createDataFrame(rdd, schema)

     // Write the partitions onto the local hard drive. Since it is going to be the
     // local file system, the partitions will be written in the same directory of the
     // same machine.
     // In a distributed setting though, two different machines would each hold a single
     // partition.
     val localPath = Files.createTempDirectory("spark-connector-propagate").toAbsolutePath.toString
     // Delete the directory, the default mode is ErrorIfExists
     Files.delete(Paths.get(localPath))
     df.write.format("tfrecords")
       .option("recordType", "Example")
       .option("writeLocality", "local")
       .save(localPath)

     // Read again this directory, this time using the Hadoop file readers, it should
     // return the same data.
     // This only works in this test and does not hold in general, because the partitions
     // will be written on the workers. Everything runs locally for tests.
     val df2 = spark.read.format("tfrecords").option("recordType", "Example")
       .load(localPath).sort("id").select("id", "IntegerTypeLabel", "LongTypeLabel",
       "FloatTypeLabel", "DoubleTypeLabel", "VectorLabel", "name") // Correct column order.

     assert(df2.collect().toSeq === testRows.toSeq)
   }
 }
}

If I understood correctly, there are two partitions of the dataset and it seems that it's attempting write locally with the same file name.

Has anybody ran into this issue or am I missing a step?

Please note, that I posted a similar question on GitHub

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
MajesticKhan
  • 158
  • 1
  • 11

1 Answers1

0

I had a feeling that I missed a step considering that this is a very valuable package and that many people have successfully installed the spark-tensorflow-connector:

I did not build the Tensorflow hadoop as a Maven Dependency which was clearly defined in step 3.

However, when building the Tensorflow hadoop, I had to use an additional command: export _JAVA_OPTIONS=-Djdk.net.URLClassPath.disableClassPathURLCheck=true as suggested by Michael from Maven surefire could not find ForkedBooter class

EDIT: The issue still persists on Dataproc

Solution:

After some researching, I just directly loaded the latest version for spark-tensorflow-connector and installed it with the directions posted by Maven. I did not have to install Tensorflow Hadoop as suggested in the Tensorflow Ecosystem. Please take note that I was able to install the jar file on my Dataproc cluster.

Community
  • 1
  • 1
MajesticKhan
  • 158
  • 1
  • 11