read a parquet files from HDFS using PyArrow

Question

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()

I also know I can read a parquet file using pyarrow.parquet's read_table()

However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance.

Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along.

score 7 · Accepted Answer · answered Nov 22 '17 at 21:07

7

Try

fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)

or

import pyarrow.parquet as pq
with fs.open(path) as f:
    pq.read_table(f, **read_options)

I opened https://issues.apache.org/jira/browse/ARROW-1848 about adding some more explicit documentation about this

answered Nov 22 '17 at 21:07

Wes McKinney

101,437
32
142
108

Thanks! just to make sure- Do they both use the same mechanism? Does read_table() accept "file handles" in general? – Jay Nov 23 '17 at 04:10

score 1 · Answer 2 · answered Dec 04 '19 at 09:44

I tried the same via Pydoop library and engine = pyarrow and it worked perfect for me.Here is the generalized method.

!pip install pydoop pyarrow
import pydoop.hdfs as hd

#read files via Pydoop and return df

def readParquetFilesPydoop(path):
    with hd.open(path) as f:
        df = pd.read_parquet(f ,engine='pyarrow')
        logger.info ('file: ' +  path  +  ' : ' + str(df.shape))
        return df

score 0 · Answer 3 · answered May 23 '23 at 21:42

You can read and write with pyarrow as depicted in the accepted answer. However the APIs provided there are long deprecated and don't work with recent versions of hadoop. Use:

from pyarrow import fs
import pyarrow.parquet as pq

# connect to hadoop
hdfs = fs.HadoopFileSystem('hostname', 8020) 

# will read single file from hdfs
with hdfs.open_input_file(path) as pqt:
     df = pq.read_table(pqt).to_pandas()

# will read directory full of partitioned parquets (ie. from spark)
df = pq.ParquetDataset(path, hdfs).read().to_pandas()

read a parquet files from HDFS using PyArrow

3 Answers3

Linked