0

Need to write and read huge pandas DF. I am using pickle format right now:

  • .to_pickle to write DF to pickle
  • read_pickle to read pickle file.

I have couple of issues when pickle file size is huge (2 GB in this case)

  1. Read speed is very slow (23 second to read the data)
  2. Increasing RAM/core in VM is not improving speed

How can I read it faster? Can I use some other format which is much faster? Can I leverage parallel processing/more core functionality to read it faster?

martineau
  • 119,623
  • 25
  • 170
  • 301
  • 1
    You could try using Protocol 4 when you create the pickle files, which was available in Python 3.7 but not made the default until Python 3.8. – martineau Feb 26 '21 at 02:11
  • 2GiB / 23 sec = 89MiB/sec which is a reasonable maximum speed for spinning rust. – Aaron Feb 26 '21 at 03:21
  • @martineau: Thanks. Protocol 4 does optimize operation to certain extent. I have started using protocol 4. Improved performance by 7-10%. Although, I am still looking for more improvement. The total data size is pretty large i.e. close to 100 GB; thus I am looking for faster alternatives. Is parquet format faster than pickle format ? – krishna agrawal Feb 26 '21 at 09:42
  • Dunno about parquet, but @Aaron makes a good point. See the benchmark results in [my answer](https://stackoverflow.com/a/59013806/355230) to another question that shows the kind of speeds that can achieved just reading plain old binary data files. Maybe the answer is to use a SSD. – martineau Feb 26 '21 at 10:34

0 Answers0