0

I installed cloudera manager(CDH 5) and create own claster. Everything is good but when I run task that it run slowly(18 min). But the ruby's script is running about 5 seconds.

My task consists of:

#mapper.py 
import sys 

def do_map(doc): 
    for word in doc.split(): 
        yield word.lower(), 1 

for line in sys.stdin: 
    for key, value in do_map(line): 
        print(key + "\t" + str(value)) 

and

#reducer.py 
import sys 

def do_reduce(word, values): 
    return word, sum(values) 

prev_key = None 
values = [] 

for line in sys.stdin: 
    key, value = line.split("\t") 
    if key != prev_key and prev_key is not None: 
        result_key, result_value = do_reduce(prev_key, values) 
        print(result_key + "\t" + str(result_value)) 
        values = [] 
    prev_key = key 
    values.append(int(value)) 

if prev_key is not None: 
    result_key, result_value = do_reduce(prev_key, values) 
    print(result_key + "\t" + str(result_value)) 

I run my task this is command:

yarn jar hadoop-streaming.jar -input lenta_articles -output lenta_wordcount -file mapper.py -file reducer.py -mapper "python mapper.py" -reducer "python reducer.py"

log of run command:

15/11/17 10:14:27 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py] [/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hadoop-streaming-2.6.0-cdh5.4.8.jar] /tmp/streamjob8334226755199432389.jar tmpDir=null
15/11/17 10:14:29 INFO client.RMProxy: Connecting to ResourceManager at manager/10.128.181.136:8032
15/11/17 10:14:29 INFO client.RMProxy: Connecting to ResourceManager at manager/10.128.181.136:8032
15/11/17 10:14:31 INFO mapred.FileInputFormat: Total input paths to process : 909
15/11/17 10:14:32 INFO mapreduce.JobSubmitter: number of splits:909
15/11/17 10:14:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1447762910705_0010
15/11/17 10:14:32 INFO impl.YarnClientImpl: Submitted application application_1447762910705_0010
15/11/17 10:14:32 INFO mapreduce.Job: The url to track the job: http://manager:8088/proxy/application_1447762910705_0010/
15/11/17 10:14:32 INFO mapreduce.Job: Running job: job_1447762910705_0010
15/11/17 10:14:49 INFO mapreduce.Job: Job job_1447762910705_0010 running in uber mode : false
15/11/17 10:14:49 INFO mapreduce.Job:  map 0% reduce 0%
15/11/17 10:16:04 INFO mapreduce.Job:  map 1% reduce 0%

size of lenta_wordcount folder 2.5 mb. It consists of 909 files. Аverage file size 3КБ.

Ask questions if there is something you need to learn or perform any command

What am i doing wrong?

slip.08
  • 1
  • 2

1 Answers1

0

Hadoop is not efficient in handling large number of small files but it is efficient in processing small number of large files.

Since you have already using Cloudera, have a look at alternatives to improve performance with large number of small files with Hadoop as quoted in Cloudera article

Main reason for slow processing

Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.

If you have more number of files, you need more number of Mappers to read & process data. Thousands of Mappers processing small files & passing the output to Reducers over the Network will degrade the performance.

Passing input as sequential files with LZO compressions is one of the best alternatives to handle large number of small files. Have a look at SE Question 1 and Other Alternative

There are some other alternatives (some are not related to phtyon) but you should look at this article

Change the ingestion process/interval 
Batch file consolidation 
Sequence files 
HBase 
S3DistCp (If using Amazon EMR) 
Using a CombineFileInputFormat 
Hive configuration settings 
Using Hadoop’s append capabilities 
Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211