I have one project requirement. I'm using python script for analyzing the data. Initially, I used the txt files as an input to that python script. But as data grows, I have to switch my storage platform to Hadoop HDFS. How can I provide HDFS data to python script as an input? Is there any way? Thanks in advance.
Asked
Active
Viewed 1,467 times
2
-
Use Hadoop streaming for using python,php etc Ex: hadoop jar hadoop/tools/lib/hadoop-streaming-2.7.2.jar -mapper /mapper.php -reducer /reducer.php -input /hdfsinputpath -output /hdfsoutputpath – John Simon Jun 21 '16 at 06:29
-
This might help : http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs – Neha Milak Jun 21 '16 at 07:24
2 Answers
3
Hadoop Streaming API:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
All you need to know about that is here: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Eduardo Barbaro
- 411
- 7
- 17
-
This is what I am looking for. So, basically everything will be handled by Hadoop-streaming.jar right? No need for extra work.. am I correct? – M_Gandhi Jun 21 '16 at 23:35
0
In addition to other approaches, you can also embed Pig Latin statements and Pig commands in Python script using a JDBC-like compile, bind, run model. For Python, make sure the Jython jar is included in your class path. Refer apache pig documentation here for more details: https://pig.apache.org/docs/r0.9.1/cont.html#embed-python

janeshs
- 793
- 2
- 12
- 26