5

I have an Apache Spark data loading and transformation application with pyspark.sql that runs for half an hour before throwing an AttributeError or other run-time exceptions.

I want to test my application end-to-end with a small data sample, something like Apache Pig's ILLUSTRATE. Sampling down the data does not help much. Is there a simple way to do this?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Sam
  • 11,799
  • 9
  • 49
  • 68

2 Answers2

4

It sounds like an idea that could easily be handled by a SparkListener. It gives you access to all the low-level details that the web UI of any Spark application could ever be able to show you. All the events that are flying between the driver (namely DAGScheduler and TaskScheduler with SchedulerBackend) and executors are posted to registered SparkListeners, too.


A Spark listener is an implementation of the SparkListener developer API (that is an extension of SparkListenerInterface where all the callback methods are no-op/do-nothing).

Spark uses Spark listeners for web UI, event persistence (for Spark History Server), dynamic allocation of executors and other services.

You can develop your own custom Spark listeners and register them using SparkContext.addSparkListener method or spark.extraListeners setting.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
2
  • Go to a Spark UI of your job and you will find a DAG Visualization there. That's a graph representing your job
  • To test your job on a sample use sample as an input first of all ;) Also you may run your spark locally, not on a cluster and then debug it in IDE of your choice (like IDEA)

More info:

Community
  • 1
  • 1
Viacheslav Rodionov
  • 2,335
  • 21
  • 22