opening 3.7GB parquet file immediately killed

Question

I have Python 3.7.3 and I am using pyarrow 2.0.0 and trying to open a 3.7GB parquet file. The python script immediately terminates with "Killed" as the only thing I see. Since I don't have much to go on, I'm not sure why it was "Killed". The computer attempting to open it has 16GB of RAM, so it would seem that there should be enough RAM to handle it? Is there a way I can get more information as to why it was "Killed"?

Are you using 32-bit Python? If so then the most memory it can access is 4GB, minus some that is used for various other things. — Kemp, Mar 12 '21 at 15:02
I ran this command and it returned True, indicating 64-bit: python3 -c "import sys; print(sys.maxsize > 2**32)" — raphael75, Mar 12 '21 at 15:18
Parquet files are compressed. A 3.7GB parquet file could easily be more than 16GB when uncompressed. “Killed” means the Linux OOM killer killed your process and so you must be exceeding RAM. Try using ParquetFil.read_row_group to stream the file. If you’re lucky there is more than one row group and you can read it piecemeal. — Pace, Mar 12 '21 at 16:39
It looks like the file contains about 26 row groups, and it is able to load them if I use read_row_group. Thank you!! — raphael75, Mar 12 '21 at 18:46
@Pace When writing the parquet file, do you have to create row groups manually or does it automatically make them? I don't see a create_row_group, etc. function. — raphael75, Mar 12 '21 at 20:37
It depends what tool you are using to write them. For `pyarrow` if you are using `pq.write_table` or `pq.write_to_dataset` then there is a `row_group_size` argument. I'm not as familiar with non-arrow parquet writers but I assume they have something similar. Also, if for whatever reason, you are stuck with a single row group, you can always read piecemeal by columns. First read in the metadata to get the list of columns and then read batches of columns. — Pace, Mar 12 '21 at 21:11

score 4 · Accepted Answer · answered Mar 12 '21 at 21:13

The message "Killed" comes from the Linux OOM killer. You can confirm this behavior by inspecting logs.

A parquet file is compressed and so a 3.7GB parquet file could contain more than 16GB of data once loaded into memory.

You will need to read the file piecemeal. If the file has row groups you can read it one row group at a time. If the file does not have row groups (or you don't want to read it that way) you can pick fewer columns to load.

opening 3.7GB parquet file immediately killed

1 Answers1