Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py

871 questions

votes

2 answers

Getting the count of records in a data frame quickly

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.

scala apache-spark hadoop-streaming

asked Sep 06 '16 at 20:22

thunderhemu

votes

2 answers

Importing text file : No Columns to parse from file

I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set: 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 …

python pandas hadoop-streaming

asked Oct 22 '16 at 14:45

mezz

votes

2 answers

R install packages from Shell

I am trying to implement a reducer for Hadoop Streaming using R. However, I need to figure out a way to access certain libraries that are not built in R, dplyr..etc. Based on my research seems like there are two approaches: (1) In the reducer code,…

r ansible hadoop-streaming

asked Nov 18 '14 at 01:31

B.Mr.W.

18,910
35
114
178

votes

1 answer

How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of…

hadoop amazon-web-services hadoop-streaming emr

asked Jun 14 '14 at 10:15

verve

votes

1 answer

stateful and stateless streaming processing

While starting to learn streaming processing, I hear the following two technical items: stateful streaming processing, and stateless streaming processing, what are the difference between them? I heard storm is stateless while storm trident is…

streaming apache-storm hadoop-streaming

asked May 16 '16 at 04:14

user785099

5,323
10
44
62

votes

5 answers

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on…

python hadoop mapreduce hadoop-streaming mrjob

asked Jun 11 '13 at 05:50

Kiran Karanth

votes

3 answers

How to import a custom module in a MapReduce job?

I have a MapReduce job defined in main.py, which imports the lib module from lib.py. I use Hadoop Streaming to submit this job to the Hadoop cluster as follows: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py …

python mapreduce hadoop-streaming

asked Aug 09 '13 at 15:22

ffriend

27,562
13
91
132

votes

1 answer

Hadoop streaming with C# and Mono : IdentityMapper being used incorrectly

I have mapper and reducer executables written in C#. I want to use these with Hadoop streaming. This is the command I'm using to create the Hadoop job... hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar -input…

c# mono hadoop-streaming

asked Nov 02 '12 at 04:44

user1793093

votes

4 answers

How do I pass a parameter to a python Hadoop streaming job?

For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in? I understand that streaming jobs are called in the format of: hadoop jar…

python hadoop hadoop-streaming

asked Mar 01 '12 at 00:43

zzztimbo

2,293
4
28
31

votes

7 answers

Hadoop Java Error : Exception in thread "main" java.lang.NoClassDefFoundError: WordCount (wrong name: org/myorg/WordCount)

I am new to hadoop. I followed the maichel-noll tutorial to set up hadoop in single node.I tried running WordCount program. This is the code I used: import java.io.IOException; import java.util.StringTokenizer; import…

java hadoop jar hadoop-streaming

asked Nov 02 '14 at 15:24

Aswin Alagappan

votes

5 answers

Are there any distributed machine learning libraries for using Python with Hadoop?

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java. As far as I can tell there are no well…

python hadoop mapreduce hadoop-streaming elastic-map-reduce

asked Jan 09 '13 at 11:03

iRoygbiv

votes

1 answer

Amazon MapReduce best practices for logs analysis

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent. Tons of logs generated every hour and that number likely to be increased…

hadoop logging amazon-s3 amazon-emr hadoop-streaming

asked Mar 23 '12 at 11:47

webdevbyjoss

votes

3 answers

Hadoop: Error: java.lang.RuntimeException: Error in configuring object

I have Hadoop installed and working perfectly because I run the word count example and it works great. Now I tried to move forward and do some more real examples. My example is done in this website as Example 2 (Average Salaries by each department)…

python hadoop hadoop-streaming

asked Jul 02 '18 at 16:36

muazfaiz

4,611
14
50
88

votes

2 answers

Hadoop is not showing my job in the job tracker even though it is running

Problem: When I submit a job to my hadoop 2.2.0 cluster it doesn't show up in the job tracker but the job completes successfully. By this I can see the output and it is running correctly and prints output as it is running. I have tried muliple…

java hadoop hadoop-streaming hadoop-yarn

asked Jan 25 '14 at 00:38

Chris Hinshaw

6,967
2
39
65

votes

3 answers

Hadoop streaming - remove trailing tab from reducer output

I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs. My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as…

hadoop hadoop-streaming

asked Aug 08 '13 at 18:22

Eddified

3,085
8
36
47

2 3

…

58 59 Next