In anticipation of having to debug our Python code by looking for the the error messages in the log files, I have created a Hadoop Streaming job that throws an exception but I can't locate the error message (or the stack trace).
Similar questions hadoop streaming: where are application logs? and hadoop streaming: how to see application logs? use Python's logging
module which is not desirable here because Python already logs the error so we shouldn't have to.
Here is the mapper code; we use Hadoop's built-in reducer aggregate
.
#!/usr/bin/python
import sys, re
import random
def main(argv):
line = sys.stdin.readline()
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
try:
while line:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "\t" + "1"
x = 1 / random.randint(0,99)
line = sys.stdin.readline()
except "end of file":
return None
if __name__ == "__main__":
main(sys.argv)
The x = 1 / random.randint(0,99)
line is supposed to create a ZeroDivisionError
and indeed the job fails but grepping the log files doesn't show the error. Is there a special flag we need to be setting someplace?
We have gone through the Google Dataproc documentation as well as the Hadoop Streaming documentation.