How to speed-up read_pickle?

Asked Feb 26 '21 at 00:25

Active Feb 26 '21 at 02:00

Viewed 1,254 times

Need to write and read huge pandas DF. I am using pickle format right now:

.to_pickle to write DF to pickle
read_pickle to read pickle file.

I have couple of issues when pickle file size is huge (2 GB in this case)

Read speed is very slow (23 second to read the data)
Increasing RAM/core in VM is not improving speed

How can I read it faster? Can I use some other format which is much faster? Can I leverage parallel processing/more core functionality to read it faster?

edited Feb 26 '21 at 02:00

martineau

119,623
25
170
301

asked Feb 26 '21 at 00:25

krishna agrawal

1

You could try using Protocol 4 when you create the pickle files, which was available in Python 3.7 but not made the default until Python 3.8. – martineau Feb 26 '21 at 02:11
2GiB / 23 sec = 89MiB/sec which is a reasonable maximum speed for spinning rust. – Aaron Feb 26 '21 at 03:21
@martineau: Thanks. Protocol 4 does optimize operation to certain extent. I have started using protocol 4. Improved performance by 7-10%. Although, I am still looking for more improvement. The total data size is pretty large i.e. close to 100 GB; thus I am looking for faster alternatives. Is parquet format faster than pickle format ? – krishna agrawal Feb 26 '21 at 09:42
Dunno about parquet, but @Aaron makes a good point. See the benchmark results in [my answer](https://stackoverflow.com/a/59013806/355230) to another question that shows the kind of speeds that can achieved just reading plain old binary data files. Maybe the answer is to use a SSD. – martineau Feb 26 '21 at 10:34

How to speed-up read_pickle?

0 Answers0