3

I am trying to read a very simple but somehow large(800Mb) csv file using the csv library in python. The delimiter is a single tab and each line consists of some numbers. Each line is a record, and I have 20681 rows in my file. I had some problems during my calculations using this file,it always stops at a certain row. I got suspicious about the number of rows in the file.I used the code below to count the number of row in this file:

tfdf_Reader = csv.reader(open('v2-host_tfdf_en.txt'),delimiter=' ')
c = 0
for row in tfdf_Reader:
  c = c + 1
print c

To my surprise c is printed with the value of 61722!!! Why is this happening? What am I doing wrong?

Hossein
  • 40,161
  • 57
  • 141
  • 175
  • hello hello ... if you have solved your problem, accept an answer or write your own answer and accept it -- otherwise you need to give more information so that we can help you. – John Machin Jun 18 '10 at 00:40
  • hi, sorry for the belated reply.The problem was that it was saved in the Unix format. So I had no choice to install Ubuntu and throw windows away. Noe everything is fine.I used vim and saw that long rows has caused this problem Btw, thanks for the debug code, it helped me a lot. I choose it as and answer so if anyone had the same problem can use it. – Hossein Jun 18 '10 at 15:04
  • "The problem was that it was saved in the Unix format. So I had no choice to install Ubuntu and throw windows away." -- "saved in the Unix format" should not be a problem with Python; in future, consider describing your problem and seekng more help before taking such drastic action. – John Machin Jun 18 '10 at 22:12
  • this helped me: http://stackoverflow.com/questions/5973363/bulkloader-csv-size-error – shigeta Jan 17 '13 at 20:12

2 Answers2

2

800 million bytes in the file and 20681 rows means that the average row size is over 38 THOUSAND bytes. Are you sure? How many numbers do you expect in each line? How do you know that you have 20681 rows? That the file is 800 Mb?

61722 rows is almost exactly 3 times 20681 -- is the number 3 of any significance e.g. 3 logical sub-sections of each record?

To find out what you really have in your file, don't rely on what it looks like. Python's repr() function is your friend.

Are you on Windows? Even if not, always open(filename, 'rb').

If the fields are tab-separated, then don't put delimeter=" " (whatever is between the quotes appears not to be a tab). Put delimiter="\t".

Try putting some debug statements in your code, like this:

DEBUG = True
f = open('v2-host_tfdf_en.txt', 'rb')
if DEBUG:
    rawdata = f.read(200)
    f.seek(0)
    print 'rawdata', repr(rawdata)
    # what is the delimiter between fields? between rows?
tfdf_Reader = csv.reader(f,delimiter=' ')
c = 0
for row in tfdf_Reader:
    c = c + 1
    if DEBUG and c <= 10:
        print "row", c, repr(row)
        # Are you getting rows like you expect?
print "rowcount", c

Note: if you are getting Error: field larger than field limit (131072), that means your file has 128Kb of data with no delimiters.

I'd suspect that:

(a) your file has random junk or a big chunk of binary zeroes apppended to it -- this should be obvious in a hex editor; it also should be obvious in a TEXT editor. Print all the rows that you do get to help identify where the trouble starts.

or (b) the delimiter is a string of one or more whitespace characters (space, tab), the first few rows have tabs, and the remaining rows have spaces. If so, this should be obvious in a hex editor (or in Notepad++, especially if you do View/Show Symbol/Show all characters). If this is the case, you can't use csv, you'd need something simple like:

f = open('v2-host_tfdf_en.txt', 'r') # NOT 'rb'
rows = [line.split() for line in f]
John Machin
  • 81,303
  • 11
  • 141
  • 189
0

My first guess would be the delimeter. How are you ensuring the delimeter is a tab? What is actually the value you are passing? (the code your pased lists a space, but I'm sure you intended to pass something else).

If your file is tab separated, then look specifically for '\t' as your delimeter. Looking for a space would mess up situations where there is space in your data that is not a column separator.

Also, if your file is an excel-tab, then there is a special "dialect" for that.

Uri
  • 88,451
  • 51
  • 221
  • 321
  • i am actually saying it by looking at the data. I see there is a space between my values. – Hossein Jun 16 '10 at 21:41
  • How are you looking at your data? with a Hex editor or just via a text editor? A text editor might not correctly present the tabs. – Uri Jun 16 '10 at 21:43
  • with notpad++. I used '\t' but after several lines it gives me this error: Traceback (most recent call last): File "C:\Users\Hossein\Documents\UvA Study Materials\ECML\Codes\TFIDFMaker\TFIDFgenerator.py", line 43, in process_tf_idf() File "C:\Users\Hossein\Documents\UvA Study Materials\ECML\Codes\TFIDFMaker\TFIDFgenerator.py", line 21, in process_tf_idf for row in tfdf_Reader: Error: field larger than field limit (131072) – Hossein Jun 16 '10 at 21:45