0

I'm new to Hadoop and I'm running a Map Reduce process to count revenue of different stores. The mapper and reducer programs work perfectly. And i double-checked the files and the direcotries.

When i run the MapReduce command which is:

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce1/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar \
  -mapper mapper.py \
  -reducer reducer.py \
  -input /home/anwarvic \
  -output /joboutput

it gives the following output:

17/04/30 05:48:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/30 05:48:14 INFO Configuration.deprecation: mapred.job.tracker is` deprecated. Instead, use mapreduce.jobtracker.address
packageJobJar: [mapper.py, reducer.py] [] /tmp/streamjob7598928362555913238.jar tmpDir=null
17/04/30 05:48:15 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 
17/04/30 05:48:16 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/04/30 05:48:21 INFO mapred.FileInputFormat: Total input paths to process : 5
17/04/30 05:48:21 INFO net.NetworkTopology: Adding a new node: /default-rack/127.0.0.1:50010
17/04/30 05:48:24 INFO mapreduce.JobSubmitter: number of splits:6
17/04/30 05:48:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1493523215757_0002
17/04/30 05:48:27 INFO impl.YarnClientImpl: Submitted application application_1493523215757_0002
17/04/30 05:48:28 INFO mapreduce.Job: The url to track the job: http://anwar-computer:8088/proxy/application_1493523215757_0002/
17/04/30 05:48:28 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
17/04/30 05:48:28 INFO streaming.StreamJob: Running job: job_1493523215757_0002
17/04/30 05:48:28 INFO streaming.StreamJob: Job running in-process (local Hadoop)
17/04/30 05:48:29 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:49:08 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:49:10 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:49:41 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:49:42 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:49:43 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:49:45 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:50:07 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:50:08 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:50:37 INFO streaming.StreamJob:  map 100%  reduce 100%
17/04/30 05:50:41 INFO streaming.StreamJob: Job running in-process     (local Hadoop)
17/04/30 05:50:41 ERROR streaming.StreamJob: Job not successful.     Error: Task failed task_1493523215757_0002_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

17/04/30 05:50:41 INFO streaming.StreamJob: killJob...
17/04/30 05:50:41 INFO impl.YarnClientImpl: Killed application application_1493523215757_0002
Streaming Command Failed!

The output basically says that the Job is not successful although the Map and Reduce processes are done by 100%

As states in this answer and this, I added the shebang headers to both mapper.py and reduce.py files:

#!/usr/bin/env python

By the way, this answer didn't work for me!

I've been in this problem for around 20 hours.. so any help would be so appreciated

Community
  • 1
  • 1
Anwarvic
  • 12,156
  • 4
  • 49
  • 69

1 Answers1

0

I would recommend following steps:

  1. Visit your job tracker URL: http://anwar-computer:8088/proxy/application_1493523215757_0002/
  2. Go to failed mappers (Your log says that mapper has failed Job failed as tasks failed. failedMaps:1 failedReduces:0). You can see the Exception trace.
  3. For more detailed logs, following the logs link available within the failed mappers page.

Analyze the logs and most likely you can get to the root cause.

Possible root causes:

  1. Your data might be malformed or different than what you are expecting in the Mapper.
  2. Another reason could be the data size and available memory on the node. Data might be compressed and during mapper step when it is uncompressed, it is overflowing the memory.

I am suspecting that point 1 could be the reason as the process has tried to run the mapper for multiple time and failing.

17/04/30 05:48:29 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:49:08 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:49:10 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:49:41 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:49:42 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:49:43 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:49:45 INFO streaming.StreamJob:  map 0%  reduce 0%
17/04/30 05:50:07 INFO streaming.StreamJob:  map 17%  reduce 0%
17/04/30 05:50:08 INFO streaming.StreamJob:  map 0%  reduce 0%

Further more, you can add more logs in your Mapper to get better details.

Alternatively, you can also enable logger (add --loglevel DEBUG parameter to hadoop command). e.g.

hadoop \
  --loglevel DEBUG \
  jar /usr/local/hadoop/share/hadoop/mapreduce1/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar \
  -mapper mapper.py \
  -reducer reducer.py \
  -input /home/anwarvic \
  -output /joboutput

Reference: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/CommandsManual.html

Ambrish
  • 3,627
  • 2
  • 27
  • 42