With the code you posted, you are doing the following conversions:
- Load data from Disk into RAM; feather files already are in the Arrow format.
- Convert the DataFrame from Arrow into pandas
- Convert the DataFrame from pandas into Arrow
- Serialize the DataFrame from Arrow into Parquet.
Steps 2-4 are each expensive steps to do. You will not be able to avoid 4 but by keeping the data in Arrow without going the loop into pandas, you can avoid 2+3 with the following code snippet:
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
table = feather.read_table("file.feather")
pq.write_table(table, "path/file.parquet")
A minor issue but you should avoid using the .gzip
ending with Parquet files. A .gzip
/ .gz
ending indicates that the whole file is compressed with gzip
and that you can unzip it with gunzip
. This is not the case with gzip-compressed Parquet files. The Parquet format compresses individual segments and leaves the metadata uncompressed. This leads to nearly the same compression at a much higher compression speed. The compression algorithm is thus an implementation detail and not transparent to other tools.