Reading first lines of bz2 files in python

Question

I am trying to extract 10'000 first lines from a bz2 file.

   import bz2       
   file = "file.bz2"
   file_10000 = "file.txt"

   output_file = codecs.open(file_10000,'w+','utf-8')

   source_file = bz2.open(file, "r")
   count = 0
   for line in source_file:
       count += 1
       if count < 10000:
           output_file.writerow(line)

But I get an error "'module' object has no attribute 'open'". Do you have any ideas? Or may be I could save 10'000 first lines to a txt file in some other way? I am on Windows.

Which version of python are you using? `bz2.open` is in python 3 not python 2. Try `bz2.BZ2File` instead. — tdelaney, May 11 '16 at 20:33
I have python 2.7, with `bz2.BZ2File` I get the same error message — student, May 11 '16 at 20:39
That's not possible. How do you use BZ2File and what error do you get? — tjollans, May 11 '16 at 20:41
`source_file = bz2.BZ2File(file, "r")` I've just replaces open with BZ2File — student, May 11 '16 at 20:43
I'm as puzzled as @tjollans - since you replaced `open` with `BZ2File`, you shouldn't get an error with "open". Unless its in a different part of your code. Does my solution work for you? If not, can you add the full stack trace to your question? — tdelaney, May 11 '16 at 21:06
Right after `import bz2` you could add `bz2.BZ2File` just to see if your module is corrupted. — tdelaney, May 11 '16 at 21:07
Sorry, I've meant I get the same error "'module' object has no attribute" but of course this time with BZ2File "'module' object has no attribute 'BZ2File'" — student, May 12 '16 at 09:59
if i write "from bz2 import BZ2File" I've get an error "cannot import name BZ2File" — student, May 12 '16 at 10:00

score 14 · Accepted Answer · answered May 11 '16 at 20:55

Here is a fully working example that includes writing and reading a test file that is much smaller than your 10000 lines. Its nice to have working examples in questions so we can test easily.

import bz2
import itertools
import codecs

file = "file.bz2"
file_10000 = "file.txt"

# write test file with 9 lines
with bz2.BZ2File(file, "w") as fp:
    fp.write('\n'.join('123456789'))

# the original script using BZ2File ... and 3 lines for test
# ...and fixing bugs:
#     1) it only writes 9999 instead of 10000
#     2) files don't do writerow
#     3) close the files

output_file = codecs.open(file_10000,'w+','utf-8')

source_file = bz2.BZ2File(file, "r")
count = 0
for line in source_file:
    count += 1
    if count <= 3:
       output_file.write(line)
source_file.close()
output_file.close()

# show what you got
print('---- Test 1 ----')
print(repr(open(file_10000).read()))

A more efficient way to do it is to break out of the for loop after reading the lines you want. you can even leverage iterators to thin up the code like so:

# a faster way to read first 3 lines
with bz2.BZ2File(file) as source_file,\
        codecs.open(file_10000,'w+','utf-8') as output_file:
    output_file.writelines(itertools.islice(source_file, 3))

# show what you got
print('---- Test 2 ----')
print(repr(open(file_10000).read()))

Goodies · Answer 2 · 2016-05-11T21:29:15.953

7

This is definitely a simpler way of doing it than the other answer, but it would be an easy way to do so in both Python2/3. Also, it would short-circuit if you don't have >= 10,000 lines.

from bz2 import BZ2File as bzopen

# writing to a file
with bzopen("file.bz2", "w") as bzfout:
    for i in range(123456):
        bzfout.write(b"%i\n" % i)

# reading a bz2 archive
with bzopen("file.bz2", "r") as bzfin:
    """ Handle lines here """
    lines = []
    for i, line in enumerate(bzfin):
        if i == 10000: break
        lines.append(line.rstrip())

print(lines)

edited May 11 '16 at 21:29

answered May 11 '16 at 21:06

Goodies

4,439
3
31
57

Since python 3 also has `BZ2File` I see no need to do the dual imports. – tdelaney May 11 '16 at 21:08
The `open` is wrapped with an `io.TextIOWrapper` object so you have more flexibility with encodings, etc... I prefer it to BZ2 file to begin with. – Goodies May 11 '16 at 21:11
1

_"This is definitely a simpler way of doing it than the other answer,"_ are you referring to my answer? How is this simpler? It pulls all 10000 lines into memory and doesn't write an output file at all. – tdelaney May 11 '16 at 21:12
You don't really have more flexibility if you also want to maintain python 2 and 3 compatibility. It just seems more complicated to me! – tdelaney May 11 '16 at 21:14
1

@tdelaney you can easily write the file to another. That wasn't the point. The point was to show an easy, cross-version way of opening an archive... I don't expect him to read 10,000 lines into memory. The only reason it stays is the `lines` list. If he doesn't have that, it doesn't read it into memory. – Goodies May 11 '16 at 21:28

rfportilla · Answer 3 · 2020-04-01T21:02:23.860

2

Just another variation.

import bz2

myfile =  'c:\\my_dir\\random.txt.bz2'
newfile = 'c:\\my_dir\\random_10000.txt'

stream = bz2.BZ2File(myfile)
with open(newfile, 'w') as f:
  for i in range(1,10000):
    f.write(stream.readline())

edited Apr 01 '20 at 21:02

answered May 11 '16 at 21:29

rfportilla

310
2
15

score 0 · Answer 4 · answered Dec 07 '16 at 04:29

0

This worked for me:

sudo apt-get install python-dev
sudo pip install backports.lzma

answered Dec 07 '16 at 04:29

jmunsch

22,771
11
93
114

Reading first lines of bz2 files in python

4 Answers4