Background:
I frequently write java code on Apache Spark for ETL job, usually on cloudera CDH cluster with every data source set up. The data I'm processing is usually dirty. For example for zip code, I was thinking I could use an integer to represent it but there could exists some record like "85281-281" and we couldn't parse it as a integer. Thus an exception is thrown and the program halt with a bunch of stack trace.
Previously I have to write the code based on my assumption, run it on the cluster, failed with a huge bunch of stack trace. Including the line number that throws the exception. But it is really time consuming to find the root cause especially I don't the specific line of the data.
So I'm think the following:
The code shouldn't stop when errors occurs. This is easy, use Java's exception handling system can easily achieve this.
I want to know the variables content in current stack which cause the error, so that I could find the root cause of the input (specific line in the input). Not just the line number that throws the exception. For example NullPointerException, I want to know the raw input data, which specific line in the input file caused this.
We have e.printStackTrace() to show the all function call in the stack trace. Can we do it better by also showing the content on top of the current stack? Just like a debugger do?
Of course I would manually print out all variables by hand coding. But I just want to know if there is a specific function that a debugger would use to show those variables.