0

I am getting below error when my python script executed using Hadoop Map Reduce. But it works fine when the script is executed locally (not using Mapreduce).

When I remove below code from the script, MapReduce works fine. df_notnull = df[df[column].notnull()].copy()

please advise what i am doing wrong?

Script:

#!/usr/local/bin/python3

import os
import sys
import pandas as pd
import string

column='first_name'
seperator_in='|'
seperator_out='|'
compression_in=None
compression_out=None
encrypted_value_list = []

df = pd.read_csv(sys.stdin, compression=compression_in, sep=seperator_in, low_memory=False, encoding='utf-8-sig')
df_notnull = df[df[column].notnull()].copy()
df_notnull.to_csv(sys.stdout, compression=compression_out, index=False, sep=seperator_out)

Submit Command:

hadoop jar /usr/env/hadoop-mapreduce-client/hadoop-streaming.jar -D mapreduce.job.queuename=QueueName -D mapreduce.job.reduces=1 -file /file/path/pythonScript.py -mapper 'python3 pythonScript.py' -input /file/path/python_test.csv -output /file/path/ouput001/out_17

Error:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

st_bones
  • 119
  • 1
  • 3
  • 12
  • Does this help? https://stackoverflow.com/questions/17037300/running-a-job-using-hadoop-streaming-and-mrjob-pipemapred-waitoutputthreads – user1558604 Dec 11 '19 at 23:37
  • You need to go to the actual container logs in YARN to see the real error. Chances are, pandas is not installed on every machine that your job runs on. – OneCricketeer Dec 12 '19 at 02:46
  • It works fine when I remove below code from the script. I believe there is syntex error which doesn’t support with mapReduce.. please advice df_notnull = df[df[column].notnull()].copy() – st_bones Dec 12 '19 at 12:23
  • Mapreduce just runs the code on a remote machine as is. If there is a syntax issue, script fails locally as well. If pandas isn't available on the other server, obviously it would fail... But again, you must open the yarn UI to see real errors – OneCricketeer Dec 13 '19 at 03:39
  • In any case, you really shouldn't be using pandas here, but rather Spark if you want dataframes on Hadoop – OneCricketeer Dec 13 '19 at 03:41

0 Answers0