24

This is related to the question about zip bombs, but having gzip or bzip2 compression in mind, e.g. a web service accepting .tar.gz files.

Python provides a handy tarfile module that is convenient to use, but does not seem to provide protection against zipbombs.

In python code using the tarfile module, what would be the most elegant way to detect zip bombs, preferably without duplicating too much logic (e.g. the transparent decompression support) from the tarfile module?

And, just to make it a bit less simple: No real files are involved; the input is a file-like object (provided by the web framework, representing the file a user uploaded).

Community
  • 1
  • 1
Joachim Breitner
  • 25,395
  • 6
  • 78
  • 139
  • Can't you use TarInfo.size ? – damiankolasa Nov 29 '12 at 10:03
  • 1
    @fatfredyy you can hit gz bomb before you unzip the tar. – Jakozaur Nov 29 '12 at 10:08
  • 1
    What effect of the bomb are you worried about? Memory usage only? Also disk space usage when extracting (per the referenced question)? – Mark Adler Dec 25 '12 at 00:37
  • Hmm, my question got downvoted without explanation, and don’t understand the closevote: Isn’t this about a very clear and specific programming task? – Joachim Breitner Dec 09 '13 at 20:13
  • 1
    Sigh. It seems that some people think this is a sysadmin question (and that is possible from a quick reading). So I slightly clarified this question: This is really about writing code that makes a web application gzip-bomp-safe. – Joachim Breitner Dec 11 '13 at 00:19
  • 1
    related: [gzip, bz2, lzma: add option to limit output size](https://bugs.python.org/issue15955) – jfs Dec 22 '14 at 04:11

5 Answers5

15

You could use resource module to limit resources available to your process and its children.

If you need to decompress in memory then you could set resource.RLIMIT_AS (or RLIMIT_DATA, RLIMIT_STACK) e.g., using a context manager to automatically restore it to a previous value:

import contextlib
import resource

@contextlib.contextmanager
def limit(limit, type=resource.RLIMIT_AS):
    soft_limit, hard_limit = resource.getrlimit(type)
    resource.setrlimit(type, (limit, hard_limit)) # set soft limit
    try:
        yield
    finally:
        resource.setrlimit(type, (soft_limit, hard_limit)) # restore

with limit(1 << 30): # 1GB 
    # do the thing that might try to consume all memory

If the limit is reached; MemoryError is raised.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 1
    There is no reason for a properly implemented tar.gz extractor to take more than about 40K of memory, regardless of the size of the archive or the amount of uncompressed data. The amount of disk space taken when extracted is another matter, but this wouldn't help with that. – Mark Adler Dec 24 '12 at 23:12
  • @MarkAdler: The OP is only interested in the case when all data is in memory: *"No real files are involved"*, *"I have the data in memory"* – jfs Dec 24 '12 at 23:47
  • 1
    The question linked by the poster describes the zip bomb issue as _"When opened they fill the server's disk."_. So it's not clear. – Mark Adler Dec 25 '12 at 00:35
  • In any case, you should not need any appreciable memory to examine or extract a .tar.gz file, regardless of size. – Mark Adler Dec 25 '12 at 00:38
  • This is a possibility and maybe a good general precausion when handling untrusted data. The disadvantage is that it is hard to tell in advance *when* the processing fails. If that code is not well suited for rolling back after the exception, checking in advance whether it is safe would be preferred. – Joachim Breitner Dec 25 '12 at 14:14
  • You can also add `resource.RLIMIT_FSIZE` to limit "The maximum size of a file which the process may create". I don't think this will work for child processes though. – Tim Ludwinski Dec 16 '15 at 20:34
  • @TimLudwinski: yes, you can pass whatever is appropriate in your case that is why I've made `type` to be a parameter instead of hardcoding it. In **this** question, all data is in memory. – jfs Dec 16 '15 at 20:38
6

This will determine the uncompressed size of the gzip stream, while using limited memory:

#!/usr/bin/python
import sys
import zlib
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
while True:
    buf = z.unconsumed_tail
    if buf == "":
        buf = f.read(1024)
        if buf == "":
            break
    got = z.decompress(buf, 4096)
    if got == "":
        break
    total += len(got)
print total
if z.unused_data != "" or f.read(1024) != "":
    print "warning: more input after end of gzip stream"

It will return a slight overestimate of the space required for all of the files in the tar file in when extracted. The length includes those files, as well as the tar directory information.

The gzip.py code does not control the amount of data decompressed, except by virtue of the size of the input data. In gzip.py, it reads 1024 compressed bytes at a time. So you can use gzip.py if you're ok with up to about 1056768 bytes of memory usage for the uncompressed data (1032 * 1024, where 1032:1 is the maximum compression ratio of deflate). The solution here uses zlib.decompress with the second argument, which limits the amount of uncompressed data. gzip.py does not.

This will accurately determine the total size of the extracted tar entries by decoding the tar format:

#!/usr/bin/python

import sys
import zlib

def decompn(f, z, n):
    """Return n uncompressed bytes, or fewer if at the end of the compressed
       stream.  This only decompresses as much as necessary, in order to
       avoid excessive memory usage for highly compressed input.
    """
    blk = ""
    while len(blk) < n:
        buf = z.unconsumed_tail
        if buf == "":
            buf = f.read(1024)
        got = z.decompress(buf, n - len(blk))
        blk += got
        if got == "":
            break
    return blk

f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
left = 0
while True:
    blk = decompn(f, z, 512)
    if len(blk) < 512:
        break
    if left == 0:
        if blk == "\0"*512:
            continue
        if blk[156] in ["1", "2", "3", "4", "5", "6"]:
            continue
        if blk[124] == 0x80:
            size = 0
            for i in range(125, 136):
                size <<= 8
                size += blk[i]
        else:
            size = int(blk[124:136].split()[0].split("\0")[0], 8)
        if blk[156] not in ["x", "g", "X", "L", "K"]:
                total += size
        left = (size + 511) // 512
    else:
        left -= 1
print total
if blk != "":
    print "warning: partial final block"
if left != 0:
    print "warning: tar file ended in the middle of an entry"
if z.unused_data != "" or f.read(1024) != "":
    print "warning: more input after end of gzip stream"

You could use a variant of this to scan the tar file for bombs. This has the advantage of finding a large size in the header information before you even have to decompress that data.

As for .tar.bz2 archives, the Python bz2 library (at least as of 3.3) is unavoidably unsafe for bz2 bombs consuming too much memory. The bz2.decompress function does not offer a second argument like zlib.decompress does. This is made even worse by the fact that the bz2 format has a much, much higher maximum compression ratio than zlib due to run-length coding. bzip2 compresses 1 GB of zeros to 722 bytes. So you cannot meter the output of bz2.decompress by metering the input as can be done with zlib.decompress even without the second argument. The lack of a limit on the decompressed output size is a fundamental flaw in the Python interface.

I looked in the _bz2module.c in 3.3 to see if there is an undocumented way to use it to avoid this problem. There is no way around it. The decompress function in there just keeps growing the result buffer until it can decompress all of the provided input. _bz2module.c needs to be fixed.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Are you sure this works? How will tar know about the size without uncompressing the gzip wrapper around it? Note that I am worried about the gzip bomb, not the tar bomb! – Joachim Breitner Dec 23 '12 at 23:12
  • I tested it, it does not work: For packing a 10GB file of zeroes in a `.tar.gz` file results in a 10MB large file. Running your code on that file with `ulimit -v 200000` set fails, so it uses much more than the 10MB input and hence is susceptible to zipbombs. – Joachim Breitner Dec 24 '12 at 12:28
  • Ok, that might work, as `z.decompress` is safe, but that would not work for bzip2 (unless I am missing something), and cannot easily adopted due to shortcomings of the the bzip library API. Also, it seems to be more complicated and error prone than the code in my solution. – Joachim Breitner Dec 25 '12 at 14:12
  • The gzip.py code does not control the amount of data decompressed, except by virtue of the size of the input data. In gzip.py, it reads 1024 compressed bytes at a time. So your method will work if you're ok with up to about 1056768 bytes of memory usage for the uncompressed data (1032 * 1024, where 1032:1 is the maximum compression ratio of deflate). My solution uses zlib.decompress with the second argument, which limits the amount of uncompressed data. gzip.py does not. – Mark Adler Dec 25 '12 at 17:17
  • You are correct about there being no memory-safe way use the Python bz2 library. The bz2.decompress function does not offer a second argument like zlib.decompress does. This is made even worse by the fact that the bz2 format has a much, much higher maximum compression ratio than zlib due to run-length coding. bzip2 compresses 1 GB of zeros to 722 bytes. So you cannot meter the output of bz2.decompress by metering the input as can be done with zlib.decompress even without the second argument. The lack of a limit on the decompressed output size is a fundamental flaw in the Python interface. – Mark Adler Dec 25 '12 at 17:22
  • Just verified by looking at _bz2module.c in 3.3. There is no way around it. The decompress function in there just keeps growing the result buffer until it can decompress all of the provided input. _bz2module.c needs to be fixed. – Mark Adler Dec 25 '12 at 17:43
  • Well, your initial answer did not work – isn’t that a reason for a -1? However, your investigation about the Python bz2 library is valuable information; why not move that from the comment to the answer? – Joachim Breitner Dec 25 '12 at 18:24
3

If you develop for linux, you can run decompression in separate process and use ulimit to limit the memory usage.

import subprocess
subprocess.Popen("ulimit -v %d; ./decompression_script.py %s" % (LIMIT, FILE))

Keep in mind that decompression_script.py should decompress the whole file in memory, before writing to disk.

Jakozaur
  • 1,957
  • 3
  • 18
  • 20
3

I guess the answer is: There is no easy, readymade solution. Here is what I use now:

class SafeUncompressor(object):
    """Small proxy class that enables external file object
    support for uncompressed, bzip2 and gzip files. Works transparently, and
    supports a maximum size to avoid zipbombs.
    """
    blocksize = 16 * 1024

    class FileTooLarge(Exception):
        pass

    def __init__(self, fileobj, maxsize=10*1024*1024):
        self.fileobj = fileobj
        self.name = getattr(self.fileobj, "name", None)
        self.maxsize = maxsize
        self.init()

    def init(self):
        import bz2
        import gzip
        self.pos = 0
        self.fileobj.seek(0)
        self.buf = ""
        self.format = "plain"

        magic = self.fileobj.read(2)
        if magic == '\037\213':
            self.format = "gzip"
            self.gzipobj = gzip.GzipFile(fileobj = self.fileobj, mode = 'r')
        elif magic == 'BZ':
            raise IOError, "bzip2 support in SafeUncompressor disabled, as self.bz2obj.decompress is not safe"
            self.format = "bz2"
            self.bz2obj = bz2.BZ2Decompressor()
        self.fileobj.seek(0)


    def read(self, size):
        b = [self.buf]
        x = len(self.buf)
        while x < size:
            if self.format == 'gzip':
                data = self.gzipobj.read(self.blocksize)
                if not data:
                    break
            elif self.format == 'bz2':
                raw = self.fileobj.read(self.blocksize)
                if not raw:
                    break
                # this can already bomb here, to some extend.
                # so disable bzip support until resolved.
                # Also monitor http://stackoverflow.com/questions/13622706/how-to-protect-myself-from-a-gzip-or-bzip2-bomb for ideas
                data = self.bz2obj.decompress(raw)
            else:
                data = self.fileobj.read(self.blocksize)
                if not data:
                    break
            b.append(data)
            x += len(data)

            if self.pos + x > self.maxsize:
                self.buf = ""
                self.pos = 0
                raise SafeUncompressor.FileTooLarge, "Compressed file too large"
        self.buf = "".join(b)

        buf = self.buf[:size]
        self.buf = self.buf[size:]
        self.pos += len(buf)
        return buf

    def seek(self, pos, whence=0):
        if whence != 0:
            raise IOError, "SafeUncompressor only supports whence=0"
        if pos < self.pos:
            self.init()
        self.read(pos - self.pos)

    def tell(self):
        return self.pos

It does not work well for bzip2, so that part of the code is disabled. The reason is that bz2.BZ2Decompressor.decompress can already produce an unwanted large chunk of data.

Joachim Breitner
  • 25,395
  • 6
  • 78
  • 139
0

I also need to handle zip bombs in uploaded zipfiles.

I do this by creating a fixed size tmpfs, and unzipping to that. If the extracted data is too large then the tmpfs will run out of space and give an error.

Here is the linux commands to create a 200M tmpfs to unzip to.

sudo mkdir -p /mnt/ziptmpfs
echo 'tmpfs   /mnt/ziptmpfs         tmpfs   rw,nodev,nosuid,size=200M          0  0' | sudo tee -a /etc/fstab
Duke Dougal
  • 24,359
  • 31
  • 91
  • 123