Python decompression relative performance?

Question

TLDR; Of the various compression algorithms available in python gzip, bz2, lzma, etc, which has the best decompression performance?

Full discussion:

Python 3 has various modules for compressing/decompressing data including gzip, bz2 and lzma. gzip and bz2 additionally have different compression levels you can set.

If my goal is to balance file size (/compression ratio) and decompression speed (compression speed is not a concern), which is going to be the best choice? Decompression speed is more important than file size, but as the uncompressed files in question would be around 600-800MB each (32-bit RGB .png image files), and I have a dozen of them, I do want some compression.

My use case is that I am loading a dozen images from disk, doing some processing on them (as a numpy array) and then using the processed array data in my program.
- The images never change, I just have to load them each time I run my program.
- The processing takes about the same length of time as the loading (several seconds), so I'm trying to save some loading time by saving the processed data (using pickle) rather than loading the raw, unprocessed, images every time. Initial tests were promising - loading the raw/uncompressed pickled data took less than a second, vs 3 or 4 seconds to load and process the original image - but as mentioned resulted in file sizes of around 600-800MB, while the original png images were only around 5MB. So I'm hoping I can strike a balance between loading time and file size by storing the picked data in a compressed format.
UPDATE: The situation is actually a bit more complicated than I represented above. My application uses PySide2, so I have access to the Qt libraries.
- If I read the images and convert to a numpy array using pillow (PIL.Image), I actually don't have to do any processing, but the total time to read the image into the array is around 4 seconds.
- If instead I use QImage to read the image, I then have to do some processing on the result to make it usable for the rest of my program due to the endian-ness of how QImage loads the data - basically I have to swap the bit order and then rotate each "pixel" so that the alpha channel (which is apparently added by QImage) comes last rather than first. This whole process takes about 3.8 seconds, so marginally faster than just using PIL.
- If I save the numpy array uncompressed, then I can load them back in in .8 seconds, so by far the fastest, but with large file size.

┌────────────┬────────────────────────┬───────────────┬─────────────┐
│ Python Ver │     Library/Method     │ Read/unpack + │ Compression │
│            │                        │ Decompress (s)│    Ratio    │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.7.2      │ pillow (PIL.Image)     │ 4.0           │ ~0.006      │
│ 3.7.2      │ Qt (QImage)            │ 3.8           │ ~0.006      │
│ 3.7.2      │ numpy (uncompressed)   │ 0.8           │ 1.0         │
│ 3.7.2      │ gzip (compresslevel=9) │ ?             │ ?           │
│ 3.7.2      │ gzip (compresslevel=?) │ ?             │ ?           │
│ 3.7.2      │ bz2 (compresslevel=9)  │ ?             │ ?           │
│ 3.7.2      │ bz2 (compresslevel=?)  │ ?             │ ?           │
│ 3.7.2      │ lzma                   │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.7.3      │ ?                      │ ?             │ ?           │  
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.8beta1   │ ?                      │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.8.0final │ ?                      │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.5.7      │ ?                      │ ?             │ ?           │
├────────────┼────────────────────────┼───────────────┼─────────────┤
│ 3.6.10     │ ?                      │ ?             │ ?           │
└────────────┴────────────────────────┴───────────────┴─────────────┘

Sample .png image: As an example, take this 5.0Mb png image, a fairly high resolution image of the coastline of Alaska.

Code for the png/PIL case (load into a numpy array):

from PIL import Image
import time
import numpy

start = time.time()
FILE = '/path/to/file/AlaskaCoast.png'
Image.MAX_IMAGE_PIXELS = None
img = Image.open(FILE)
arr = numpy.array(img)
print("Loaded in", time.time()-start)

this load takes around 4.2s on my machine with Python 3.7.2.

Alternatively, I can instead load the uncompressed pickle file generated by picking the array created above.

Code for the uncompressed pickle load case:

import pickle
import time

start = time.time()    
with open('/tmp/test_file.pickle','rb') as picklefile:
  arr = pickle.load(picklefile)    
print("Loaded in", time.time()-start)

Loading from this uncompressed pickle file takes ~0.8s on my machine.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/195383/discussion-on-question-by-ibrewster-python-decompression-relative-performance). — Samuel Liew, Jun 22 '19 at 01:25
Your sample image doesn't have an alpha channel - it is RGB rather than the RGBA you suggest. — Mark Setchell, Jun 22 '19 at 09:41
@MarkSetchell True, but when I load it using QImage, QImage adds an alpha channel. Apparently. That probably just confuses the issue though - PIL doesn't do that. — ibrewster, Jun 22 '19 at 12:33

maxy · Answer 1 · 2019-06-23T22:24:49.477

The low-hanging fruit

numpy.savez_compressed('AlaskaCoast.npz', arr)
arr = numpy.load('AlaskaCoast.npz')['arr_0']

Loading is 2.3x faster than your PIL-based code.

It uses zipfile.ZIP_DEFLATED, see savez_compressed docu.

Your PIL code also has an unneeded copy: array(img) should be asarray(img). It only costs 5% of the slow loading time. But after optimization this will be significant and you have to keep in mind which numpy operators create a copy.

Fast decompression

According to the zstd benchmarks, when optimizing for decompression lz4 is a good choice. Just plugging this into pickle gives another 2.4x gain and is only 30% slower than uncompressed pickling.

import pickle
import lz4.frame

# with lz4.frame.open('AlaskaCoast.lz4', 'wb') as f:
#     pickle.dump(arr, f)

with lz4.frame.open('AlaskaCoast.lz4', 'rb') as f:
    arr = pickle.load(f)

Benchmarks

method                 size   load time
------                 ----   ---------
original (PNG+PIL)     5.1M   7.1
np.load (compressed)   6.7M   3.1
pickle + lz4           7.1M   1.3
pickle (uncompressed)  601M   1.0 (baseline)

The load time was measured inside Python (3.7.3), using the minimum wall-clock time over 20 runs on my desktop. According to occasional glances at top it always seemed to be running on a single core.

For the curious: profiling

I'm not sure if the Python version matters, most work is supposed to happen inside of C libraries. To validate this I've profiled the pickle + lz4 variant:

perf record ./test.py && perf report -s dso
Overhead  Shared Object
  60.16%  [kernel.kallsyms]  # mostly page_fault and alloc_pages_vma
  27.53%  libc-2.28.so       # mainly memmove
   9.75%  liblz4.so.1.8.3    # only LZ4_decompress_*
   2.33%  python3.7
   ...

Most time is spent inside of the Linux kernel, doing page_fault and stuff associated with (re-)allocating memory, probably including disk I/O. The high amount of memmove looks suspicious. Probably Python is re-allocating (resizing) the final array every time a new decompressed chunk arrives. If anyone likes to have a closer look: python and perf profiles.

Neat. Is this for 2.7.3 or 2.7.2? (please edit your answer to state). You might like to copy the table format I put in the OP's question (where there's a column for version). We know that performance numbers will change with versions, so it would be nice to keep this question 'live' going forward. — smci, Jun 23 '19 at 18:49
I've gone a bit overboard and added the *profiling* section. — maxy, Jun 23 '19 at 22:11
I haven't tested the lz4 lib yet, but Python-Blosc with lz4 should be quite similar. I guess you can get a lot better performance by handling the metadata (array shape, dtype) seperatly from the actual data which is inside the array object. Pickling something is usually quite slow (a lot of memmove or memcopy as you observed) — max9111, Jun 25 '19 at 20:07
The lz4 lib was quite good, and I really like this answer. However, as it turns out, blosc was exponentially faster, at least on my machine, giving load times in the sub-100 ms range, so it gets the nod. — ibrewster, Jun 26 '19 at 22:23

max9111 · Accepted Answer · 2021-03-16T12:22:35.193

You can use Python-blosc

It is very fast and for small arrays (<2GB) also quite easy to use. On easily compressable data like your example, it is often faster to compress the data for IO operations. (SATA-SSD: about 500 MB/s, PCIe- SSD: up to 3500MB/s) In the decompression step the array allocation is the most costly part. If your images are of similar shape you can avoid repeated memory allocation.

Example

A contigous array is assumed for the following example.

import blosc
import pickle

def compress(arr,Path):
    #c = blosc.compress_ptr(arr.__array_interface__['data'][0], arr.size, arr.dtype.itemsize, clevel=3,cname='lz4',shuffle=blosc.SHUFFLE)
    c = blosc.compress_ptr(arr.__array_interface__['data'][0], arr.size, arr.dtype.itemsize, clevel=3,cname='zstd',shuffle=blosc.SHUFFLE)
    f=open(Path,"wb")
    pickle.dump((arr.shape, arr.dtype),f)
    f.write(c)
    f.close()
    return c,arr.shape, arr.dtype

def decompress(Path):
    f=open(Path,"rb")
    shape,dtype=pickle.load(f)
    c=f.read()
    #array allocation takes most of the time
    arr=np.empty(shape,dtype)
    blosc.decompress_ptr(c, arr.__array_interface__['data'][0])
    return arr

#Pass a preallocated array if you have many similar images
def decompress_pre(Path,arr):
    f=open(Path,"rb")
    shape,dtype=pickle.load(f)
    c=f.read()
    #array allocation takes most of the time
    blosc.decompress_ptr(c, arr.__array_interface__['data'][0])
    return arr

Benchmarks

#blosc.SHUFFLE, cname='zstd' -> 4728KB,  
%timeit compress(arr,"Test.dat")
1.03 s ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#611 MB/s
%timeit decompress("Test.dat")
146 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#4310 MB/s
%timeit decompress_pre("Test.dat",arr)
50.9 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#12362 MB/s

#blosc.SHUFFLE, cname='lz4' -> 9118KB, 
%timeit compress(arr,"Test.dat")
32.1 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#19602 MB/s
%timeit decompress("Test.dat")
146 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#4310 MB/s
%timeit decompress_pre("Test.dat",arr)
53.6 ms ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#11740 MB/s

Edit

This version is more for general use. It does handle f-contiguous, c-contiguous and non-contiguous arrays and arrays >2GB. Also have a look at bloscpack.

import blosc
import pickle

def compress(file, arr,clevel=3,cname='lz4',shuffle=1):
    """
    file           path to file
    arr            numpy nd-array
    clevel         0..9
    cname          blosclz,lz4,lz4hc,snappy,zlib
    shuffle        0-> no shuffle, 1->shuffle,2->bitshuffle
    """
    max_blk_size=100_000_000 #100 MB 

    shape=arr.shape
    #dtype np.object is not implemented
    if arr.dtype==np.object:
        raise(TypeError("dtype np.object is not implemented"))

    #Handling of fortran ordered arrays (avoid copy)
    is_f_contiguous=False
    if arr.flags['F_CONTIGUOUS']==True:
        is_f_contiguous=True
        arr=arr.T.reshape(-1)
    else:
        arr=np.ascontiguousarray(arr.reshape(-1))

    #Writing
    max_num=max_blk_size//arr.dtype.itemsize
    num_chunks=arr.size//max_num

    if arr.size%max_num!=0:
        num_chunks+=1

    f=open(file,"wb")
    pickle.dump((shape,arr.size,arr.dtype,is_f_contiguous,num_chunks,max_num),f)
    size=np.empty(1,np.uint32)
    num_write=max_num
    for i in range(num_chunks):
        if max_num*(i+1)>arr.size:
            num_write=arr.size-max_num*i
        c = blosc.compress_ptr(arr[max_num*i:].__array_interface__['data'][0], num_write, 
                               arr.dtype.itemsize, clevel=clevel,cname=cname,shuffle=shuffle)
        size[0]=len(c)
        size.tofile(f)
        f.write(c)
    f.close()

def decompress(file,prealloc_arr=None):
    f=open(file,"rb")
    shape,arr_size,dtype,is_f_contiguous,num_chunks,max_num=pickle.load(f)

    if prealloc_arr is None:
        if prealloc_arr.flags['F_CONTIGUOUS']==True
            prealloc_arr=prealloc_arr.T
        if prealloc_arr.flags['C_CONTIGUOUS']!=True
            raise(TypeError("Contiguous array is needed"))
        arr=np.empty(arr_size,dtype)
    else:
        arr=np.frombuffer(prealloc_arr.data, dtype=dtype, count=arr_size)

    for i in range(num_chunks):
        size=np.fromfile(f,np.uint32,count=1)
        c=f.read(size[0])
        blosc.decompress_ptr(c, arr[max_num*i:].__array_interface__['data'][0])
    f.close()

    #reshape
    if is_f_contiguous:
        arr=arr.reshape(shape[::-1]).T
    else:
        arr=arr.reshape(shape)
    return arr

Not only is this method by far the fastest on my machine (even with allocating the array each time), but it also produces the smallest files. Win-Win! — ibrewster, Jun 26 '19 at 22:40
This worked better than lz4 for me, doubling the 'effective' read speed on NVMe. Though that allocation takes the most time isn't necessarily the case, especially for lower-precision dtypes. — OverLordGoldDragon, Mar 01 '21 at 14:58
For use as a function in the wild, the compression should start with `arr = np.ascontiguousarray(arr)`, in case arr is a view of an array, and not contiguous in memory. I tried to edit the answer, but the edit-queue is full. — Atnas, Mar 11 '21 at 16:29
@Atnas I will edit the answer. I am a bit short in time now. There is also other usefull stuff to implement eg. handling of arrays which are larger than 2GB and handling of fortran ordered arrays, where a ascontigousarry does not make sense, but instead just a flag that it is fortran contiguous, will be necessary. — max9111, Mar 11 '21 at 16:37
@Atnas I added a bit more advanced version (arrays >2GB, fortran- c contiguous arrays and non contiguous arrays) — max9111, Mar 14 '21 at 21:43

Mark Setchell · Answer 3 · 2019-06-22T20:18:22.127

You can continue to use your existing PNGs and enjoy the space saving, but gain some speed by using libvips. Here is a comparison, but rather than test the speed of my laptop versus yours, I have shown 3 different methods so you can see the relative speed. I used:

PIL
OpenCV
pyvips

#!/usr/bin/env python3

import numpy as np
import pyvips
import cv2
from PIL import Image

def usingPIL(f):
    im = Image.open(f)
    return np.asarray(im)

def usingOpenCV(f):
    arr = cv2.imread(f,cv2.IMREAD_UNCHANGED)
    return arr

def usingVIPS(f):
    image = pyvips.Image.new_from_file(f)
    mem_img = image.write_to_memory()
    imgnp=np.frombuffer(mem_img, dtype=np.uint8).reshape(image.height, image.width, 3) 
    return imgnp

Then I checked the performance in IPython because it has nice timing functions. As you can see, pyvips is 13 times faster than PIL even with PIL 2x faster than the original version because of avoiding array copy:

In [49]: %timeit usingPIL('Alaska1.png')                                                            
3.66 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit usingOpenCV('Alaska1.png')                                                         
6.82 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit usingVIPS('Alaska1.png')                                                           
276 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Quick test results match
np.sum(usingVIPS('Alaska1.png') - usingPIL('Alaska1.png')) 
0

6502 · Answer 4 · 2019-06-22T06:26:08.957

Something I think should be fast is

use gzip (or other) for compression
directly store the compressed data in a python module as literal bytes
load decompressed form into numpy array directly

i.e. write a program that generates a source code like

import gzip, numpy
data = b'\x00\x01\x02\x03'
unpacked = numpy.frombuffer(gzip.uncompress(data), numpy.uint8)

the packed data ends up encoded directly into the .pyc file

For low-entropy data gzip decompression should be quite fast (edit: not really surprisingly lzma is even faster, and it's still a predefined python module)

With your "alaska" data this approach gives the following performance on my machine

compression   source module size   bytecode size   import time
-----------   ------------------   -------------   -----------
gzip -9               26,133,461       9,458,176          1.79
lzma                  11,534,009       2,883,695          1.08

You can even distribute just the .pyc provided you can control the python version used; the code to load a .pyc in Python 2 was a one liner but is now more convoluted (apparently it was decided that loading .pyc isn't supposed to be convenient).

Note that the compilation of the module is reasonably fast (e.g. the lzma version compiles on my machine in just 0.1 seconds) but it's a pity to waste on disk 11Mb more for no real reason.

Just to clarify, you feel that `gzip` should be faster than `bz2` or `lzma` on the decompression side? — ibrewster, Jun 21 '19 at 22:27
@ibrewster: indeed lzma is about 1.7 times faster than gzip on my machine on the file you provided as an example. The total running time of a program loading the numpy array in memory is about 1s on my machine... — 6502, Jun 22 '19 at 06:03
@CharlesDuffy: I'd not be surprised if speed depends on the actual data, and for example with the provided alaska file `lzma` decompression is ~1.7 times faster. — 6502, Jun 22 '19 at 06:24