17

Is there a way to append to a .feather format file using pd.to_feather?

I am also curious if anyone knows some of the limitations in terms of max file size, and whether it is possible to query for some specific data when you read a .feather file (such as read rows where date > '2017-03-31').

I love the idea of being able to store my dataframes and categorical data.

trench
  • 5,075
  • 12
  • 50
  • 80
  • 2
    Isn't hdf5 more suitable for this? As far as I know, feather is only designed to quickly move data from R to Python (or vice versa). It isn't mean to actually store the data. – ayhan Jun 17 '17 at 18:35
  • @trench, did you find anything about appending into a feather file? – r2evans Oct 29 '17 at 20:34
  • I did not - the latest pandas also includes Parquet read/write so I am looking into that right now actually. Most of my data is just stored in csv files and database tables currently, but I do want to explore some of these options – trench Oct 30 '17 at 18:17
  • 1
    @ayhan HDF5 has some limitations compared to feather. For example, [HDF5 does not support extension dtypes](https://github.com/pandas-dev/pandas/issues/31199). – gerrit Jan 22 '20 at 13:14

2 Answers2

7

Unfortunately, as both feather and parquet are columnar-oriented files. This means that you're not able to "append" as that's only possible in row-oriented file formats. Alternatives you could look into if you want to use parquet or feather is to partition the files. For example, if you have data that doesn't change, and is generated once per day, you can write and partition based on date. It does create some overhead when reading and writing out the file, but might be a better option than re-writing the entire file each time.

As it's columnar format, you're also not able to query and only read in rows where e.g. date>2017-01-01, what parquet excels at is that you're rather able to only read in the columns you need for your analysis.

Pureluck
  • 326
  • 2
  • 10
  • I don't think that the fact that parquet is column oriented is the reason why you cannot append new data. In fact parquet is a self contained file (i.e. it contains both data and related metadata and also performs compression: https://www.upsolver.com/blog/apache-parquet-why-use It means that if you need to add data, it needs to recompute the compression tree and update the metadata on the whole file, not just write some blocks on the storage system. – Iqigai Oct 27 '22 at 04:38
1

For quite some time, Feather (as well as Parquet) have used a "chuncked" structure, that makes writing the files in chuncks possible. While not strictly an "append", it provides most of the benefits and only requires a little additional work to structure it in code.

See https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data

dsz
  • 4,542
  • 39
  • 35