138

Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.

it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string

I read this Q&A on Pickle, Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?

Community
  • 1
  • 1
msunbot
  • 1,871
  • 4
  • 16
  • 16

4 Answers4

177
  • joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
  • joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
  • if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
  • since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
  • But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with mmap_mode="r".
ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 2
    Does it mean that we should use `Joblib` over `Pickle`? Any downsides of `Joblib` that we should consider? I've just heard about `Joblib` recently and it sounds interesting to me. – Chau Pham Jan 31 '19 at 05:40
  • 3
    I have updated my answer with downsides and new stuff happening in the standard library. – ogrisel Jan 31 '19 at 14:03
  • 2
    Does joblib also execute arbitrary code during unparsing? (Unsafe) – Mr-Programs Mar 24 '20 at 06:37
  • 2
    This is hard to read through all the "Note that..." and get the one-line summary: *joblib is X times faster to write large numpy arrays in 3.8, roughly what is X? and to read? and pickle is roughly Y times faster to write lots of small Python objects, what is Y? and to read?* Also, what are the relative compression ratios/filesizes? – smci Jun 14 '20 at 23:48
  • By default neither joblib nor pickle compress the data. So the size of the file is approximately the same as the array in memory. But you can dump into a compressing file object in both cases (e.g. https://docs.python.org/3/library/gzip.html#gzip.GzipFile). joblib also has a highlevel way to do it: https://joblib.readthedocs.io/en/latest/persistence.html#compressed-joblib-pickles The compression ratio depends on the data in the array (random => low compression, regular / constant => high). – ogrisel Jun 21 '20 at 16:29
  • 2
    I wonder if this answer is still valid 10 years later. `scikit-learn` stills suggests using `joblib`. There must be a reason, right? – Dr_Zaszuś Jun 02 '22 at 14:38
12

Thanks to Gunjan for giving us this script! I modified it for Python3 results

#comapare pickle loaders
from time import time
import pickle
import os
import _pickle as cPickle
from sklearn.externals import joblib

file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
t1 = time()
lis = []
d = pickle.load(open(file,"rb"))
print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)

t1 = time()
cPickle.load(open(file,"rb"))
print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)

t1 = time()
joblib.load(file)
print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)

time for loading file size with pickle 79708 KB => 0.16768312454223633
time for loading file size with cpickle 79708 KB => 0.0002372264862060547
time for loading file size joblib 79708 KB => 0.0006849765777587891
Michael Mano
  • 3,339
  • 2
  • 14
  • 35
  • Gunjan used a 1154320653 KB pickle file. Could a bigger file make a difference in favor of joblib? – guiferviz Jun 07 '19 at 17:38
  • 4
    Please please please always state your Python version when showing performance numbers. 2.6? 2.7? 3.6? 3.7? Better still, report relative numbers joblib vs pickle vs cPickle. Also, fix Gunjan's mistake of 1.1 GB not 1.1 TB – smci Jun 14 '20 at 23:58
  • 1
    Just some questions: (1) Is the line `lis = []` needed? (2) How can the code be reproduced? That is, how should we construct the `database` file? Thank you. – RMurphy Aug 25 '21 at 16:21
9

I came across same question, so i tried this one (with Python 2.7) as i need to load a large pickle file

#comapare pickle loaders
from time import time
import pickle
import os
try:
   import cPickle
except:
   print "Cannot import cPickle"
import joblib

t1 = time()
lis = []
d = pickle.load(open("classi.pickle","r"))
print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1

t1 = time()
cPickle.load(open("classi.pickle","r"))
print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1

t1 = time()
joblib.load("classi.pickle")
print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1

Output for this is

time for loading file size with pickle 1154320653 KB => 6.75876188278
time for loading file size with cpickle 1154320653 KB => 52.6876490116
time for loading file size joblib 1154320653 KB => 6.27503800392

According to this joblib works better than cPickle and Pickle module from these 3 modules. Thanks

Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
Gunjan
  • 2,775
  • 27
  • 30
  • 2
    I thought cpickle should be faster than pickle? – Echo May 01 '16 at 05:00
  • Is this benchmark done with python 3, which uses pickle(protocol=3) by default (which is faster than the default in python2)? – LearnOPhile Sep 15 '17 at 06:42
  • 4
    os.path.getsize returns bytes not kilobytes, so we're talking about a file of approximately 1,1 GB (and not 1,1 TB like it seems from the output) – Vlad Iliescu Apr 03 '19 at 07:13
  • This is great, but please fix up the output to reflect it's 1.1 GB not 1.1 TB. Better still would be plotting comparative numbers for filesizes in powers-of-10 from 1KB...10GB, for Python versions 3.6, 3.7, 3.8 and 2.7, for joblib, pickle and cPickle. – smci Jun 14 '20 at 23:56
-2

Just a humble note ... Pickle is better for fitted scikit-learn estimators/ trained models. In ML applications trained models are saved and loaded back up for prediction mainly.