97

How do I gzip compress a string in Python?

gzip.GzipFile exists, but that's for file objects - what about with plain strings?

Kristian Glass
  • 37,325
  • 7
  • 45
  • 73
Bdfy
  • 23,141
  • 55
  • 131
  • 179
  • 1
    @KevinDTimm, that docu only mentions `StringIO` but does not really explain how to do it. So asking that question here is completely valid, IMHO. Some more trials before asking and telling us about them would have been nice, though. – Alfe Jun 04 '15 at 08:45
  • @Alfe - the question was closed 4 years ago for much the same reason as my comment - the OP made no effort to search first. – KevinDTimm Jun 04 '15 at 13:05
  • Of course you are right, @KevinDTimm. – Alfe Jun 06 '15 at 23:43
  • 4
    How is this off-topic? –  Jun 11 '16 at 21:39
  • 2
    This question is the top hit in google now for `gzip string in python` and is very reasonable IMO. It should be re-opened. – Garrett Dec 12 '16 at 23:39
  • 2
    As above, this question is the top result in a google search, and one of the answers is correct - it really seems as though it shouldn't be closed. – darkdan21 Jan 29 '18 at 17:01

6 Answers6

166

If you want to produce a complete gzip-compatible binary string, with the header etc, you could use gzip.GzipFile together with StringIO:

try:
    from StringIO import StringIO  # Python 2.7
except ImportError:
    from io import StringIO  # Python 3.x
import gzip
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
  f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

# returns '\x1f\x8b\x08\x00\xbd\xbe\xe8N\x02\xff\x0b\xc9\xc8,V\x00\xa2\xdc\xcc\xecT\x85\xbc\xd2\xdc\xa4\xd4"\x85\xfc\xbcT\x1d\xa0X\x9ez\x89B\tH:Q!\'\xbfD!?M!\xad4\xcf\x1e\x00w\xd4\xea\xf41\x00\x00\x00'
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 2
    The opposite of this is: `def gunzip_text(text): infile = StringIO.StringIO() infile.write(text) with gzip.GzipFile(fileobj=infile, mode="r") as f: f.rewind() f.read() return out.getvalue() – fastmultiplication Apr 24 '14 at 13:07
  • 4
    @fastmultiplication: or shorter: `f = gzip.GzipFile(StringIO.StringIO(text)); result = f.read(); f.close(); return result` – Alfe Jun 04 '15 at 08:22
  • 3
    Unfortunately, the question has been close, so I can't make a new answer, but [here](https://gist.github.com/Garrett-R/dc6f08fc1eab63f94d2cbb89cb61c33d) is how to do this in Python 3. – Garrett Dec 12 '16 at 23:40
  • Probably unrelated, is compressing in memory first faster(using local disk)? – user3226167 Sep 05 '17 at 08:17
  • 1
    In Python 3: `import zlib; my_string = "hello world"; my_bytes = zlib.compress(my_string.encode('utf-8')); my_hex = my_bytes.hex(); my_bytes2 = bytes.fromhex(my_hex); my_string2 = zlib.decompress(my_bytes); assert my_string == my_string2;` – ostrokach Dec 15 '17 at 17:02
  • copying and pasting this into 3.7 iPython fails with `TypeError: string argument expected, got 'bytes'` – Chazt3n Nov 14 '22 at 18:27
72

The easiest way is the zlib encoding:

compressed_value = s.encode("zlib")

Then you decompress it with:

plain_string_again = compressed_value.decode("zlib")
minillinim
  • 690
  • 6
  • 10
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 1
    @Daniel: Yes, `s` is a Python 2.x object of type `str`. – Sven Marnach May 23 '12 at 09:10
  • 2
    See [Standard Encodings](http://docs.python.org/2/library/codecs.html#standard-encodings) for where he got that (scroll down to __"codecs"__). Also available: `s.encode('rot13')`, `s.encode( 'base64' )` – bobobobo Dec 19 '12 at 21:35
  • 12
    Note that this method is incompatible with the gzip command-line utility in that gzip includes a header and checksum, while this mechanism simply compresses the content. – tylerl Dec 29 '13 at 00:23
  • I know this is old but you line of code for decompressing should be: `plain_string_again = compressed_value.decode("zlib")` – minillinim Jul 08 '14 at 09:55
  • @minillinim: Yes, someone added this slightly wrong code to my answer. Feel free to fix it -- it doesn't matter it's old. – Sven Marnach Jul 08 '14 at 11:06
  • 9
    @BenjaminToueg: Python 3 is stricter about the distinction between Unicode strings (type `str` in Python 3) and byte strings (type `bytes`). `str` objects have an `encode()` method that returns a `bytes` object, and `bytes` objects have a `decode()` method that returns a `str`. The `zlib` codec is special in that it converts from `bytes` to `bytes`, so it doesn't fit into this structure. You can use `codecs.encode(b, "zlib")` and `codecs.decode(b, "slib")` for a `bytes` object `b` instead. – Sven Marnach Nov 27 '14 at 12:47
  • How can I direct it into a file? – alper Jan 07 '21 at 14:07
  • 1
    Beware. This answer is wrong. It does _not_ compress to the gzip format, as asked in the question. – Mark Adler Jun 04 '22 at 14:30
42

Python3 version of Sven Marnach's 2011 answer:

import gzip
exampleString = 'abcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijmortenpunnerudengelstadrocksklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuv123'
compressed_value = gzip.compress(bytes(exampleString, 'utf-8'))
plain_string_again = gzip.decompress(compressed_value).decode('utf-8')
Punnerud
  • 7,195
  • 2
  • 54
  • 44
  • 3
    In Python 3 `zlib` is still used, `gzip` actually uses `zlib`, see: https://docs.python.org/3/library/zlib.html and https://docs.python.org/3/library/gzip.html#module-gzip – gitaarik Apr 10 '19 at 09:45
  • My original answer was using zlib. Changed to gzip because that was the original question. You can easily replace from gzip to to zlib (search-and-replace) in my example, and it will work. – Punnerud Apr 10 '19 at 19:13
  • 1
    `gzip.decompress` returns bytes, so call `plain_string_again.decode('utf-8')` to get a str object – milan May 05 '22 at 18:09
  • _Unlike_ Sven Marnach's answer, this answer is correct, in that it produces the gzip format. – Mark Adler Jun 04 '22 at 19:58
3

For those who want to compress a Pandas dataframe in JSON format:

Tested with Python 3.6 and Pandas 0.23

import sys
import zlib, lzma, bz2
import math

def convert_size(size_bytes):
    if size_bytes == 0:
        return "0B"
    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

dataframe = pd.read_csv('...') # your CSV file
dataframe_json = dataframe.to_json(orient='split')
data = dataframe_json.encode()
compressed_data = bz2.compress(data)
decompressed_data = bz2.decompress(compressed_data).decode()
dataframe_aux = pd.read_json(decompressed_data, orient='split')

#Original data size:  10982455 10.47 MB
#Encoded data size:  10982439 10.47 MB
#Compressed data size:  1276457 1.22 MB (lzma, slow), 2087131 1.99 MB (zlib, fast), 1410908 1.35 MB (bz2, fast)
#Decompressed data size:  10982455 10.47 MB
print('Original data size: ', sys.getsizeof(dataframe_json), convert_size(sys.getsizeof(dataframe_json)))
print('Encoded data size: ', sys.getsizeof(data), convert_size(sys.getsizeof(data)))
print('Compressed data size: ', sys.getsizeof(compressed_data), convert_size(sys.getsizeof(compressed_data)))
print('Decompressed data size: ', sys.getsizeof(decompressed_data), convert_size(sys.getsizeof(decompressed_data)))

print(dataframe.head())
print(dataframe_aux.head())
1

Martin Thoma's answer almost worked: I had to use BytesIO as mentioned in this answer.

from io import BytesIO # Python 3.x, haven't tested 2.7
import gzip
out = BytesIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
  f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

The original code produced a TypeError: string argument expected, got 'bytes'

-4
s = "a long string of characters"

g = gzip.open('gzipfilename.gz', 'w', 5) # ('filename', 'read/write mode', compression level)
g.write(s)
g.close()
Nakilon
  • 34,866
  • 14
  • 107
  • 142
Jon Mitten
  • 1,965
  • 4
  • 25
  • 53
  • 6
    I guess the question was about compressing a string in memory without having to write it to disk in the process. Otherwise your answer is totally correct. – Alfe Jun 04 '15 at 08:42