Reading Files in HDFS (Hadoop filesystem) directories into a Pandas dataframe

Question

I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard non-distributed algorithms.

At some level a workable solution is trivial using a "hadoop dfs -copyTolocal" followed by local file system operations, however I am looking for a particularly elegant way to load the data that I will incorporate into my standard practice.

Some characteristics of an ideal solution:

No need to create a local copy (who likes clean up?)
Minimal number of system calls
Few lines of Python code

You might like to see [this question](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) — Andy Hayden, May 16 '13 at 21:50
are you aiming to assember the query results in a distributed way? or run a single process to produce a combined frame? roughly how much data? (total shape) — Jeff, May 16 '13 at 22:33
you can use `hadoop dfs -get /path/to/file -` to stream the contents to stdout - not elegant but does meet your first ideal requirement (not ideal if the stream errors though..) — Chris White, May 17 '13 at 01:52

score 3 · Answer 1 · answered May 21 '13 at 22:39

It looks like the pydoop.hdfs module solves this problem while meeting a good set of the goals:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

I was not not able to evaluate this, as pydoop has very strict requirements to compile and my Hadoop version is a bit dated.

Reading Files in HDFS (Hadoop filesystem) directories into a Pandas dataframe

1 Answers1