4

I am trying to use the freebase data dump, but it seams that I have some problems reading the files with python. It looks like that my program cant read all the lines.

def test2():
    count=0
    for line in open(FREEBASE_TOPIC):
        count+=1
    return count

def test3():
    count=0
    for line in open(FREEBASE_QUAD):
        count+=1
    return count


if __name__ == "__main__":

   print "FREEBASE TOPIC - NR LINES:",test2()
   print "FREEBASE QUAD - NR LINES:",test3()

Results in this:

FREEBASE TOPIC - ITR TIME: 1.21000003815
FREEBASE TOPIC - NR LINES: 1643010

FREEBASE QUAD - ITER TIME: 0.797000169754
FREEBASE QUAD - NR LINES: 3155131

This can be all. It looks to be to few lines to contain the whole freebase. And I cant see how it is possible to iterate over one 33GB file and another 5GB file in 2 seconds.

What is wrong? I am downloading the files again in case something went wrong during the download process, but it takes decades with my connections, so I am asking ere in the mean time. The file size is correct, and i have printed some of the lines and they look correct.

kimg85
  • 113
  • 1
  • 2
  • 7

3 Answers3

2

there is a problem that occurred to me:

open('file', 'rb')

should solve it.

chr(26)

sometimes causes an file ending for text mode 'r' that is default.

User
  • 14,131
  • 2
  • 40
  • 59
  • open('file', 'rb') instead of open('file') worked! codecs.open('file',"r","utf-8") also works, but it produces more lines than there is because some of the Unicode characters from it signals a new line, Which is bad in tsv files. – kimg85 Jun 05 '12 at 06:58
2

It sounds like you are decompressing the files before using them. You're almost certainly better off keeping the file compressed and decompressing it as you access it.

from bz2 import BZ2File
for line in BZ2File('freebase-datadump-quadruples-<date>.tsv.bz2','rU'):
    <process a line>
Tom Morris
  • 10,490
  • 32
  • 53
0

Your script runs fine and produces the correct numbers of lines for me on Ubuntu. Could this be a limitation of your OS?

Parsing large (20GB) text file with python - reading in 2 lines as 1

Community
  • 1
  • 1
Shawn Simister
  • 4,613
  • 1
  • 26
  • 31