2

I'm implementing a tool that parses a huge set of 248GB files compressed in bz2 format. The average compression factor is 0.04, so it's quite out of question decompressing them to over 6 terabytes beforehand.

Each line of the content files is a complete JSON record, so I'm reading the files using bz2 module open then a for line in bz2file lasso, and it works nicely. The problem is I don't have any idea on how to show any measure of progress, 'cause I don't know how many compressed bytes I've read nor how many records there are in each file. Files are just huge. Some are up to 24 GB.

How would you approach this?

1 Answers1

0

Naive method

you could use tqdm like so:

from tqdm import tqdm

with open("hugefile.bz2", "r") as bz2file:
    for line in tqdm(bz2file, desc="hugefile"):
        ...

This way you will know how many lines you've processed in how much time. If you want to get a % of where you are in the process though, you'll need to know beforehand how many lines there is in the file.
If you don't know you can compute it like this:

from tqdm import tqdm

total = 0
with open("hugefile.bz2", "r") as bz2file:
    for line in bz2file:
        total += 1

with open("hugefile.bz2", "r") as bz2file:
    for line in tqdm(bz2file, desc="hugefile", total=total):
        ...

But this implies going over the file twice so you might not want to do it.

Bytes method

Another method would be to figure out how much bytes the line you're reading is, using this: https://stackoverflow.com/a/30686735/8915326

And combine it with the total file size

import os
from tqdm import tqdm

hugefile = "hugefile.bz2"
with open(hugefile, "r") as bz2file:
    with tqdm(desc=hugefile, total=os.path.getsize(hugefile)) as pbar:
        for line in bz2file:
            ...
            linesize = len(line.encode("utf-8"))
            pbar.update(linesize)

This way you're not going over your file twice but you still have to figure out how much bytes is each line.

Inspi
  • 530
  • 1
  • 4
  • 19
  • Running twice through the file kind of defeats the purpose of knowing how much I've went through the file. And the bytes method: how is the line size in bytes suppose to give any useful information, if the file is bzipped? – Fernando D'Andrea Jun 10 '21 at 20:14
  • @FernandoD'Andrea I agree on the first solution, reading the file twice is kinda bad. The second one only reads it once and it will tell you where you are in unzipping the file. we don't care that we' re getting the zipped line size because we' re comparing it against the zipped file size. so you won't know how big the resulting file is in advance, but you will know approximately how much time is left until you complete the process – Inspi Jun 12 '21 at 03:56
  • But what good is it comparing the uncompressed line size to the compressed file? I mean... Module bz2 wraps the operation completely. I have no access to the zipped file pointer. I could try checking content full size through bz2? – Fernando D'Andrea Jun 12 '21 at 04:04
  • just confirming: I can't get content size through python without fully decompressing it. – Fernando D'Andrea Jun 12 '21 at 04:06