Questions tagged [hadoop-streaming]

Hadoop streaming is a utility that allows running map-reduce jobs using any executable that reads from standard input and writes to standard output.

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer and script should be able to read from standard input and write to standard output.

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

Ruby Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb

Python Example:

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.py \
-reducer ch02/src/main/ruby/max_temperature_reduce.py
871 questions
48
votes
2 answers

Getting the count of records in a data frame quickly

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.
thunderhemu
  • 492
  • 1
  • 4
  • 8
20
votes
2 answers

Importing text file : No Columns to parse from file

I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set: 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 …
mezz
  • 427
  • 2
  • 5
  • 10
20
votes
2 answers

R install packages from Shell

I am trying to implement a reducer for Hadoop Streaming using R. However, I need to figure out a way to access certain libraries that are not built in R, dplyr..etc. Based on my research seems like there are two approaches: (1) In the reducer code,…
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
14
votes
1 answer

How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of…
verve
  • 775
  • 1
  • 9
  • 21
13
votes
1 answer

stateful and stateless streaming processing

While starting to learn streaming processing, I hear the following two technical items: stateful streaming processing, and stateless streaming processing, what are the difference between them? I heard storm is stateless while storm trident is…
user785099
  • 5,323
  • 10
  • 44
  • 62
13
votes
5 answers

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on…
Kiran Karanth
  • 133
  • 1
  • 1
  • 8
12
votes
3 answers

How to import a custom module in a MapReduce job?

I have a MapReduce job defined in main.py, which imports the lib module from lib.py. I use Hadoop Streaming to submit this job to the Hadoop cluster as follows: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py …
ffriend
  • 27,562
  • 13
  • 91
  • 132
12
votes
1 answer

Hadoop streaming with C# and Mono : IdentityMapper being used incorrectly

I have mapper and reducer executables written in C#. I want to use these with Hadoop streaming. This is the command I'm using to create the Hadoop job... hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar -input…
user1793093
  • 129
  • 5
11
votes
4 answers

How do I pass a parameter to a python Hadoop streaming job?

For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in? I understand that streaming jobs are called in the format of: hadoop jar…
zzztimbo
  • 2,293
  • 4
  • 28
  • 31
10
votes
7 answers

Hadoop Java Error : Exception in thread "main" java.lang.NoClassDefFoundError: WordCount (wrong name: org/myorg/WordCount)

I am new to hadoop. I followed the maichel-noll tutorial to set up hadoop in single node.I tried running WordCount program. This is the code I used: import java.io.IOException; import java.util.StringTokenizer; import…
Aswin Alagappan
  • 173
  • 1
  • 1
  • 11
10
votes
5 answers

Are there any distributed machine learning libraries for using Python with Hadoop?

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java. As far as I can tell there are no well…
iRoygbiv
  • 865
  • 2
  • 7
  • 21
9
votes
1 answer

Amazon MapReduce best practices for logs analysis

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent. Tons of logs generated every hour and that number likely to be increased…
webdevbyjoss
  • 504
  • 7
  • 20
9
votes
3 answers

Hadoop: Error: java.lang.RuntimeException: Error in configuring object

I have Hadoop installed and working perfectly because I run the word count example and it works great. Now I tried to move forward and do some more real examples. My example is done in this website as Example 2 (Average Salaries by each department)…
muazfaiz
  • 4,611
  • 14
  • 50
  • 88
9
votes
2 answers

Hadoop is not showing my job in the job tracker even though it is running

Problem: When I submit a job to my hadoop 2.2.0 cluster it doesn't show up in the job tracker but the job completes successfully. By this I can see the output and it is running correctly and prints output as it is running. I have tried muliple…
Chris Hinshaw
  • 6,967
  • 2
  • 39
  • 65
9
votes
3 answers

Hadoop streaming - remove trailing tab from reducer output

I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs. My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as…
Eddified
  • 3,085
  • 8
  • 36
  • 47
1
2 3
58 59