2

I have one project requirement. I'm using python script for analyzing the data. Initially, I used the txt files as an input to that python script. But as data grows, I have to switch my storage platform to Hadoop HDFS. How can I provide HDFS data to python script as an input? Is there any way? Thanks in advance.

M_Gandhi
  • 108
  • 2
  • 10
  • Use Hadoop streaming for using python,php etc Ex: hadoop jar hadoop/tools/lib/hadoop-streaming-2.7.2.jar -mapper /mapper.php -reducer /reducer.php -input /hdfsinputpath -output /hdfsoutputpath – John Simon Jun 21 '16 at 06:29
  • This might help : http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs – Neha Milak Jun 21 '16 at 07:24

2 Answers2

3

Hadoop Streaming API:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

All you need to know about that is here: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Eduardo Barbaro
  • 411
  • 7
  • 17
  • This is what I am looking for. So, basically everything will be handled by Hadoop-streaming.jar right? No need for extra work.. am I correct? – M_Gandhi Jun 21 '16 at 23:35
0

In addition to other approaches, you can also embed Pig Latin statements and Pig commands in Python script using a JDBC-like compile, bind, run model. For Python, make sure the Jython jar is included in your class path. Refer apache pig documentation here for more details: https://pig.apache.org/docs/r0.9.1/cont.html#embed-python

janeshs
  • 793
  • 2
  • 12
  • 26