Reading utf-8 characters from a gzip file in python

Question

I am trying to read a gunzipped file (.gz) in python and am having some trouble.

I used the gzip module to read it but the file is encoded as a utf-8 text file so eventually it reads an invalid character and crashes.

Does anyone know how to read gzip files encoded as utf-8 files? I know that there's a codecs module that can help but I can't understand how to use it.

Thanks!

import string
import gzip
import codecs

f = gzip.open('file.gz','r')

engines = {}
line = f.readline()
while line:
    parsed = string.split(line, u'\u0001')

    #do some things...

    line = f.readline()
for en in engines:
  print(en)

Can you convert the utf-8 file to ascii then attempt to decompress that? hmm.... — whatsisname, Dec 10 '09 at 20:06
If you are getting a UnicodeDecodeError, see this related post, which shows the use of the open('errors') parameter and mentions a caveat when using the ISO-8859-1 (latin-1) encoding: https://stackoverflow.com/questions/35028683/python3-unicodedecodeerror-with-readlines-method — Trutane, Feb 27 '21 at 10:48

Seppo Enarvi · Answer 1 · 2021-06-29T10:09:19.287

55

This is possible since Python 3.3:

import gzip
gzip.open('file.gz', 'rt', encoding='utf-8')

Notice that gzip.open() requires you to explicitly specify text mode ('t').

edited Jun 29 '21 at 10:09

answered Nov 05 '13 at 17:20

Seppo Enarvi

3,219
3
32
25

score 24 · Accepted Answer · edited May 23 '17 at 11:47

24

I don't see why this should be so hard.

What are you doing exactly? Please explain "eventually it reads an invalid character".

It should be as simple as:

import gzip
fp = gzip.open('foo.gz')
contents = fp.read() # contents now has the uncompressed bytes of foo.gz
fp.close()
u_str = contents.decode('utf-8') # u_str is now a unicode string

EDITED

This answer works for Python2 in Python3, please see @SeppoEnarvi 's answer at https://stackoverflow.com/a/19794943/610569 (it uses the rt mode for gzip.open.

edited May 23 '17 at 11:47

Community

1
1

answered Dec 10 '09 at 20:11

sjbrown

562
3
5

+1 ... That is the most lucid and least complicated of the 3 answers so far. – John Machin Dec 10 '09 at 22:49
1

Not necessarily the least complicated, in that you have to decode each line you read. In the getreader implementation, this happens automatically so each line is unicode – SecurityJoe Jan 05 '12 at 20:37
While it is a nice solution, I have a feeling that this solution won't scale well with large files. – Julien Grenier Nov 09 '16 at 15:59
Exactly. We want this to be interpreted correctly by the library, not by us after being required to read the whole thing into a string. – Dustin Oprea Jan 17 '20 at 21:01

score 21 · Answer 3 · answered Dec 10 '09 at 20:21

21

Maybe

import codecs
zf = gzip.open(fname, 'rb')
reader = codecs.getreader("utf-8")
contents = reader( zf )
for line in contents:
    pass

answered Dec 10 '09 at 20:21

Jochen Ritzel

104,512
31
200
194

5

As a one-liner: for line in codecs.getreader('utf-8')(gzip.open(fname), errors='replace') which also adds control over the error handling – SecurityJoe Jan 05 '12 at 20:38

score 7 · Answer 4 · answered Aug 10 '14 at 20:13

7

The above produced tons of decoding errors. I used this:

for line in io.TextIOWrapper(io.BufferedReader(gzip.open(filePath)), encoding='utf8', errors='ignore'):
    ...

answered Aug 10 '14 at 20:13

Yuri Astrakhan

8,808
6
63
97

score 0 · Answer 5 · answered Dec 10 '09 at 20:26

0

In pythonic form (2.5 or greater)

from __future__ import with_statement # for 2.5, does nothing in 2.6
from gzip import open as gzopen

with gzopen('foo.gz') as gzfile:
    for line in gzfile:
      print line.decode('utf-8')

answered Dec 10 '09 at 20:26

Douglas Mayle

21,063
9
42
57

Reading utf-8 characters from a gzip file in python

5 Answers5

EDITED

Linked