pyarrow.lib.ArrowIOError: Invalid Parquet file size is 0 bytes

Question

I'm trying to do something like this, reading a list of files from an S3 bucket into a pyarrow table.

If I specify the filename I can do:

from pyarrow.parquet import ParquetDataset
import s3fs
dataset = ParquetDataset(
    "s3://path/to/file/myfile.snappy.parquet",
    filesystem=s3fs.S3FileSystem(),
)

And everything works as expected. However if I do:

dataset = ParquetDataset(
    "s3://path/to/file",
    filesystem=s3fs.S3FileSystem(),
)

I get:

pyarrow/_parquet.pyx:1036: in pyarrow._parquet.ParquetReader.open                                                                                                                                                                                                              
pyarrow.lib.ArrowIOError: Invalid Parquet file size is 0 bytes

Carl Smith · Accepted Answer · 2020-07-01T21:05:47.247

6

This happened to me because of empty "success" files that were at the same S3 prefix as my parquet files. I resolved this by first listing out the parquet files and selecting only those with names ending in ".parquet":

from pyarrow.parquet import ParquetDataset
import s3fs

s3 = s3fs.S3FileSystem()

paths = [path for path in s3.ls("s3://path/to/file/") if path.endswith(".parquet")]

dataset = ParquetDataset(paths, filesystem=s3)

edited Jul 01 '20 at 21:05

answered Jul 01 '20 at 16:39

Carl Smith

710
1
8
11

this is the right answer! – NatalieL Mar 15 '22 at 14:20

score 4 · Answer 2 · answered Oct 31 '19 at 15:50

I think the answer has something to do with this, from the Apache Arrow docs:

The ParquetDataset class accepts either a directory name or a list or file paths, and can discover and infer some common partition structures, such as those produced by Hive:
dataset = pq.ParquetDataset('dataset_name/')
table = dataset.read()

So I think the automatic discovery of filenames only works if the files you're trying to get hold of are partitioned by, e.g. Hive.

pyarrow.lib.ArrowIOError: Invalid Parquet file size is 0 bytes

2 Answers2