I'm implementing a tool that parses a huge set of 248GB files compressed in bz2 format. The average compression factor is 0.04, so it's quite out of question decompressing them to over 6 terabytes beforehand.
Each line of the content files is a complete JSON record, so I'm reading the files using bz2 module open
then a for line in bz2file
lasso, and it works nicely. The problem is I don't have any idea on how to show any measure of progress, 'cause I don't know how many compressed bytes I've read nor how many records there are in each file. Files are just huge. Some are up to 24 GB.
How would you approach this?