Extract gzip file without BOM in Python 3.6

Question

I have multiple gzfile in subfolders that I want to unzip in one folder. It works fine but there's a BOM signature at the beginning of each file that I would like to be removed. I have checked other questions like Removing BOM from gzip'ed CSV in Python or Convert UTF-8 with BOM to UTF-8 with no BOM in Python but it doesn't seem to work. I use Python 3.6 in Pycharm on Windows.

Here's first my code without attempt:

import gzip
import pickle
import glob


def save_object(obj, filename):
    with open(filename, 'wb') as output:  # Overwrites any existing file.
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)


output_path = 'path_out'

i = 1

for filename in glob.iglob(
        'path_in/**/*.gz', recursive=True):
    print(filename)
    with gzip.open(filename, 'rb') as f:
        file_content = f.read()
    new_file = output_path + "z" + str(i) + ".txt"
    save_object(file_content, new_file)
    f.close()
    i += 1

Now, with the logic defined in Removing BOM from gzip'ed CSV in Python (at least what I understand of it) if I replace file_content = f.read() by file_content = csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines()), I get:

TypeError: can't pickle _csv.reader objects

I checked for this error (e.g. "Can't pickle <type '_csv.reader'>" error when using multiprocessing on Windows) but I found no solution I could apply.

"Doesn't seem to work" how exactly? Your current code doesn't seem to include any attempt. — tripleee, Mar 07 '18 at 08:51
As there are multiple solutions proposed that I tried, I think it is easier to show a clean code to get feed back. — Chris, Mar 07 '18 at 09:38
That's exactly the problem -- show us *precisely* what you tried and how it failed. — tripleee, Mar 07 '18 at 09:41
If your input is not CSV you should not be using `csv.reader()` on the text data you have just successfully converted. The attempt to `pickle` it is perhaps indicative of a more fundamental misunderstanding. — tripleee, Mar 07 '18 at 13:12
Possible duplicate of [Convert UTF-8 with BOM to UTF-8 with no BOM in Python](https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python) — tripleee, Mar 07 '18 at 15:03

tripleee · Accepted Answer · 2018-03-07T13:14:57.413

A minor adaptation of the very first question you link to trivially works.

tripleee$ cat bomgz.py
import gzip
from subprocess import run

with open('bom.txt', 'w') as handle:
    handle.write('\ufeffmoo!\n')

run(['gzip', 'bom.txt'])

with gzip.open('bom.txt.gz', 'rb') as f:
    file_content = f.read().decode('utf-8-sig')
with open('nobom.txt', 'w') as output:
    output.write(file_content)

tripleee$ python3 bomgz.py

tripleee$ gzip -dc bom.txt.gz | xxd
00000000: efbb bf6d 6f6f 210a                      ...moo!.

tripleee$ xxd nobom.txt
00000000: 6d6f 6f21 0a                             moo!.

The pickle parts didn't seem relevant here but might have been obscuring the goal of getting a block of decoded str out of an encoded blob of bytes.

Ok, I took a while to understand your response as I'm not a Python expert. The 2 last "with" operations to read and write work in my code. There's now no BOM inside. Thx! — Chris, Mar 07 '18 at 14:55

Extract gzip file without BOM in Python 3.6

1 Answers1