I am getting below error when my python script executed using Hadoop Map Reduce. But it works fine when the script is executed locally (not using Mapreduce).
When I remove below code from the script, MapReduce works fine.
df_notnull = df[df[column].notnull()].copy()
please advise what i am doing wrong?
Script:
#!/usr/local/bin/python3
import os
import sys
import pandas as pd
import string
column='first_name'
seperator_in='|'
seperator_out='|'
compression_in=None
compression_out=None
encrypted_value_list = []
df = pd.read_csv(sys.stdin, compression=compression_in, sep=seperator_in, low_memory=False, encoding='utf-8-sig')
df_notnull = df[df[column].notnull()].copy()
df_notnull.to_csv(sys.stdout, compression=compression_out, index=False, sep=seperator_out)
Submit Command:
hadoop jar /usr/env/hadoop-mapreduce-client/hadoop-streaming.jar -D mapreduce.job.queuename=QueueName -D mapreduce.job.reduces=1 -file /file/path/pythonScript.py -mapper 'python3 pythonScript.py' -input /file/path/python_test.csv -output /file/path/ouput001/out_17
Error:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)