5

I am new to Spark and trying to find my way.

I've a spark application that runs a complicated map function over a dataset. This map function can fail for mainly data related reasons. How can I get some meaningful information on what went wrong? I'm not sure where to start.

Many thanks!

PriyankaChauhan
  • 953
  • 11
  • 26
ThatDataGuy
  • 1,969
  • 2
  • 17
  • 43

2 Answers2

0

If you want to write unit tests, you can generate a dataset from a collection, map over it using your map function and test the result with a collect using your favourite test suite.

If you're using PySpark I don't really know how you could debug a Spark Job but with distributed computing engine, debug mode is always a mess so I don't even think it's worthy to explore this path.

In my company we usually go for unit testing when we want to test the logic of a specific function so you should be good to go.

This answer should cover what you need: How do I unit test PySpark programs?

Community
  • 1
  • 1
Chobeat
  • 3,445
  • 6
  • 41
  • 59
  • The code runs fine over specific test datasets. Its when I run over a large input dataset there are specific values in there that cause the logic to fail. I'd like to know what those values are. It might only be 1 in 1000 input sets that fail, so spot checking some isn't helpful. I might just have bad values in the dataset. – ThatDataGuy Oct 06 '16 at 08:46
  • What does it means "to fail"? If it throws an exception, try to catch it. If it produces wrong values, just return (input, output) instead of just (output) and watch (maybe with a filter) what kind of input produced the anomaly in the output. – Chobeat Oct 06 '16 at 08:51
  • Lets say it throws an exception and I catch it in the map function. What can do with it then? Can I log it? Where do those logs go? – ThatDataGuy Oct 06 '16 at 11:10
  • It depends on your logging configuration. Anyway you can catch them and output them (bundling them with the input) so you can find them in the result of your computation. – Chobeat Oct 06 '16 at 12:28
0

Ok, so this is indeed possible, but there are some pitfalls.

Broadly, create a class that encapsulates your results from the map function eg

class CalcResult(object):

    def __init__(self):
        self.dataResult=None
        self.TraceBackStr=None
        self.wasError=None

Then you can test the wasError field in order to log exceptions.

The exception object cannot be a traceback object as they are not pickable. So, I suggest a formatted string.

ThatDataGuy
  • 1,969
  • 2
  • 17
  • 43