11

I have a somewhat large (~20 GB) partitioned dataset in parquet format. I would like to read specific partitions from the dataset using pyarrow. I thought I could accomplish this with pyarrow.parquet.ParquetDataset, but that doesn't seem to be the case. Here is a small example to illustrate what I want.

To create a random dataset:

from collections import OrderedDict
from itertools import product, chain
from uuid import uuid4
import os
from glob import glob

import numpy as np
import pandas as pd
import pyarrow as pa
from pyarrow.parquet import ParquetWriter, ParquetDataset


def get_partitions(basepath, partitions):
    """Generate directory hierarchy for a paritioned dataset

    data
    ├── part1=foo
    │   └── part2=True
    ├── part1=foo
    │   └── part2=False
    ├── part1=bar
    │   └── part2=True
    └── part1=bar
        └── part2=False

    """
    path_tmpl = '/'.join(['{}={}'] * len(partitions))  # part=value
    path_tmpl = '{}/{}'.format(basepath, path_tmpl)    # part1=val/part2=val

    parts = [product([part], vals) for part, vals in partitions.items()]
    parts = [i for i in product(*parts)]
    return [path_tmpl.format(*tuple(chain.from_iterable(i))) for i in parts]


partitions = OrderedDict(part1=['foo', 'bar'], part2=[True, False])
parts = get_partitions('data', partitions)
for part in parts:
    # 3 columns, 5 rows
    data = [pa.array(np.random.rand(5)) for i in range(3)]
    table = pa.Table.from_arrays(data, ['a', 'b', 'c'])
    os.makedirs(part, exist_ok=True)
    out = ParquetWriter('{}/{}.parquet'.format(part, uuid4()),
                        table.schema, flavor='spark')
    out.write_table(table)
    out.close()

I want to read all values for partition one, and only True for partition 2. With pandas.read_parquet, that's not possible, I have to read the whole column always. I tried the following with pyarrow:

parts2 = OrderedDict(part1=['foo', 'bar'], part2=[True])
parts2 = get_partitions('data', parts2)
files = [glob('{}/*'.format(dirpath)) for dirpath in parts2]
files = [i for i in chain.from_iterable(files)]
df2 = ParquetDataset(files).read().to_pandas()

That doesn't work either:

>>> df2.columns
Index(['a', 'b', 'c'], dtype='object')

I can do this easily in pyspark like this:

def get_spark_session_ctx(appName):
    """Get or create a Spark Session, and the underlying Context."""
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName(appName).getOrCreate()
    sc = spark.sparkContext
    return (spark, sc)


spark, sc = get_spark_session_ctx('test')
spark_df = spark.read.option('basePath', 'data').parquet(*parts2)
df3 = spark_df.toPandas()

As you can see below:

>>> df3.columns
Index(['a', 'b', 'c', 'part1', 'part2'], dtype='object')

Can this be done with pyarrow or pandas, or do I need some custom implementation?

Update: As requested by Wes, this is now on JIRA.

Avi Turner
  • 10,234
  • 7
  • 48
  • 75
suvayu
  • 4,271
  • 2
  • 29
  • 35

2 Answers2

13

As of pyarrow version 0.10.0 you can use filters kwarg to do the query. In your case it would look like something like this:

import pyarrow.parquet as pq
dataset = pq.ParquetDataset('path-to-your-dataset', filters=[('part2', '=', 'True'),])
table = dataset.read()

Ref

ji.xu
  • 455
  • 5
  • 16
  • I know, but unfortunately that's partial support (only for reads). Write support is postponed until 0.12.0. I believe this is covered in the JIRA. – suvayu Oct 03 '18 at 10:22
  • Thanks so much for this documentation. I tried to get `ParquetDatasetPiece` working but this was the only solution I could get working. – Geochem B Jan 09 '19 at 20:55
  • `in` operator in filters did not work for me (v 10.0.0) with `pq.ParquetDataset`, but it worked with `pq.read_table()`. In case anyone faces a similar issue. – Meet Sinojia Aug 19 '21 at 13:37
5

Question: How do I read specific partitions from a partitioned parquet dataset with pyarrow?

Answer: You can't right now.

Can you create an Apache Arrow JIRA requesting this feature on https://issues.apache.org/jira?

This is something that we should be able to support in the pyarrow API but it will require someone to implement it. Thank you

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • I'll do that, thanks. I guess I should create the feature request for both red and write (unless I missed it). I've a lot of free time these days, if someone guides me, I could also work on the implementation. – suvayu Dec 29 '17 at 06:03
  • 1
    @wes How is the progress for pyarrow to read partitioned parquet table in HDFS? It seems the pyarrow HDFS API still can't provide a dataset object to `pq.ParquetDataset()` method. – robert Aug 14 '18 at 13:22