4

I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard non-distributed algorithms.

At some level a workable solution is trivial using a "hadoop dfs -copyTolocal" followed by local file system operations, however I am looking for a particularly elegant way to load the data that I will incorporate into my standard practice.

Some characteristics of an ideal solution:

  1. No need to create a local copy (who likes clean up?)
  2. Minimal number of system calls
  3. Few lines of Python code
Charles
  • 50,943
  • 13
  • 104
  • 142
Setjmp
  • 27,279
  • 27
  • 74
  • 92
  • You might like to see [this question](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) – Andy Hayden May 16 '13 at 21:50
  • are you aiming to assember the query results in a distributed way? or run a single process to produce a combined frame? roughly how much data? (total shape) – Jeff May 16 '13 at 22:33
  • you can use `hadoop dfs -get /path/to/file -` to stream the contents to stdout - not elegant but does meet your first ideal requirement (not ideal if the stream errors though..) – Chris White May 17 '13 at 01:52

1 Answers1

3

It looks like the pydoop.hdfs module solves this problem while meeting a good set of the goals:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

I was not not able to evaluate this, as pydoop has very strict requirements to compile and my Hadoop version is a bit dated.

Setjmp
  • 27,279
  • 27
  • 74
  • 92