0

I've written unit tests referring to DataframeGenerator example, which allows you to generate mock dataframes on the fly

After having executed the following commands successfully

sbt clean
sbt update
sbt compile

I get the errors shown in output upon running either of the following commands

sbt assembly
sbt test -- -oF

Output

...
[info] SearchClicksProcessorTest:
17/11/24 14:19:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/24 14:19:07 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
17/11/24 14:19:18 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/11/24 14:19:18 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/11/24 14:19:19 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
[info] - testExplodeMap *** FAILED ***
[info]   ExceptionInInitializerError was thrown during property evaluation.
[info]     Message: "None"
[info]     Occurred when passed generated values (
[info]   
[info]     )
[info] - testFilterByClicks *** FAILED ***
[info]   NoClassDefFoundError was thrown during property evaluation.
[info]     Message: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[info]     Occurred when passed generated values (
[info]   
[info]     )
[info] - testGetClicksData *** FAILED ***
[info]   NoClassDefFoundError was thrown during property evaluation.
[info]     Message: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[info]     Occurred when passed generated values (
[info]   
[info]     )
...
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 6, Failed 3, Errors 0, Passed 3
[error] Failed tests:
[error]         com.company.spark.ml.pipelines.search.SearchClicksProcessorTest
[error] (root/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 73 s, completed 24 Nov, 2017 2:19:28 PM

Things that I've tried unsuccessfully

  • Running sbt test with F flag to show full stacktrace (no stacktrace output appears as shown above)
  • Re-build the project in IntelliJ Idea

My questions are

  • What could be the possible cause of this error?
  • How can I enable the stack-trace output in SBT to be able to debug it?

EDIT-1 My unit-test class contains several methods like below

class SearchClicksProcessorTest extends FunSuite with Checkers {
  import spark.implicits._

  test("testGetClicksData") {
    val schemaIn = StructType(List(
      StructField("rank", IntegerType),
      StructField("city_id", IntegerType),
      StructField("target", IntegerType)
    ))
    val schemaOut = StructType(List(
      StructField("clicked_res_rank", IntegerType),
      StructField("city_id", IntegerType),
    ))
    val dataFrameGen = DataframeGenerator.arbitraryDataFrame(spark.sqlContext, schemaIn)

    val property = Prop.forAll(dataFrameGen.arbitrary) { dfIn: DataFrame =>
      dfIn.cache()
      val dfOut: DataFrame = dfIn.transform(SearchClicksProcessor.getClicksData)

      dfIn.schema === schemaIn &&
        dfOut.schema === schemaOut &&
        dfIn.filter($"target" === 1).count() === dfOut.count()
    }
    check(property)
  }
}

while build.sbt looks like this

// core settings
organization := "com.company"
scalaVersion := "2.11.11"

name := "repo-name"
version := "0.0.1"

// cache options
offline := false
updateOptions := updateOptions.value.withCachedResolution(true)

// aggregate options
aggregate in assembly := false
aggregate in update := false

// fork options
fork in Test := true

//common libraryDependencies
libraryDependencies ++= Seq(
  scalaTest,
  typesafeConfig,
  ...
  scalajHttp
)

libraryDependencies ++= allAwsDependencies
libraryDependencies ++= SparkDependencies.allSparkDependencies

assemblyMergeStrategy in assembly := {
  case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
  ...
  case _ => MergeStrategy.first
}

lazy val module-1 = project in file("directory-1")

lazy val module-2 = (project in file("directory-2")).
  dependsOn(module-1).
  aggregate(module-1)

lazy val root = (project in file(".")).
  dependsOn(module-2).
  aggregate(module-2)
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • Have a look at [this issue](https://github.com/holdenk/spark-testing-base/issues/216) and consider explaining the queries asked there – y2k-shubham Nov 24 '17 at 11:26
  • Whats your build file & source code for the test look like? – Holden Nov 25 '17 at 10:17
  • My guess is that the tests are executed in parallel and each tries to create a brand new `SparkSession` so I'd disable parallel test execution --> https://stackoverflow.com/q/11899723/1305344 – Jacek Laskowski Nov 25 '17 at 14:11
  • Looks like this error has nothing to do with DataFrameGenerator by @Holden. (Trying to run tests without it also results in same error) I've narrowed down the problem to creation of dataframe using following method spark.createDataFrame(rdd: RDD, schema: StructType) in particular, the rdd creation from sample Seq(Row) requires spark.parallelize method, which I believe is resulting in the error Though I still haven't been able to overcome this error, so any insight would be helpful.. – y2k-shubham Nov 29 '17 at 07:12
  • I've also tried @Jacek's suggestion to disable parallelism in test without luck – y2k-shubham Nov 29 '17 at 07:18

2 Answers2

0

P.S. Please read comments on original question before reading this answer


  • Even the popular solution of overriding SBT's transitive dependency over faster-xml.jackson didn't work for me; in that some more changes were required (ExceptionInInitializerError was gone but some other error cropped up)

  • Finally (in addition to above mentioned fix) I ended up creating DataFrames in a different way (as opposed to StructType used here). I created them as

    spark.sparkContext.parallelize(Seq(MyType)).toDF()

    where MyType is a case class as per the schema of DataFrame

  • While implementing this solution, I encountered a small problem that while datatypes of schema generated from case class were correct, the nullability of fields was often mismatching; fix for this issue was found here


Here I'm blatantly admitting that I'm not sure what was the correct fix: faster-xml.jackson dependency or the alternate way of creating DataFrame, so please feel free to fill the lapses in understanding / investigating the issue

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
0

i have had a similar problem case, and after investigating I found out, that adding a lazy before a val solved my issue. My estimate is, running a Scala program with Scalatest invokes a little different initializing sequence. Where a normal scala execution initializes vals in an sourecode line numbers top-down order - having nested object {...} blocks initialized in the same way - using the same coding with Scalatest, the execution initializes the valss in nested object { ... } blocks before the vals line-number wise above the object { ... }.

This is absolutely vague I know but deferring initialization with prefixing vals with lazy could solve the test issue here.

The crucial thing here is that it doesn't occur in normal execution, only test execution, and in my case it was only occuring when using lambdas with taps in this form:

...
.tap(x =>
        hook_feld_erweiterungen_hook(
          abc = theProblematicVal
        )
      )
...
Hartmut Pfarr
  • 5,534
  • 5
  • 36
  • 42