800 million bytes in the file and 20681 rows means that the average row size is over 38 THOUSAND bytes. Are you sure? How many numbers do you expect in each line? How do you know that you have 20681 rows? That the file is 800 Mb?
61722 rows is almost exactly 3 times 20681 -- is the number 3 of any significance e.g. 3 logical sub-sections of each record?
To find out what you really have in your file, don't rely on what it looks like. Python's repr()
function is your friend.
Are you on Windows? Even if not, always open(filename, 'rb')
.
If the fields are tab-separated, then don't put delimeter=" "
(whatever is between the quotes appears not to be a tab). Put delimiter="\t"
.
Try putting some debug statements in your code, like this:
DEBUG = True
f = open('v2-host_tfdf_en.txt', 'rb')
if DEBUG:
rawdata = f.read(200)
f.seek(0)
print 'rawdata', repr(rawdata)
# what is the delimiter between fields? between rows?
tfdf_Reader = csv.reader(f,delimiter=' ')
c = 0
for row in tfdf_Reader:
c = c + 1
if DEBUG and c <= 10:
print "row", c, repr(row)
# Are you getting rows like you expect?
print "rowcount", c
Note: if you are getting Error: field larger than field limit (131072)
, that means your file has 128Kb of data with no delimiters.
I'd suspect that:
(a) your file has random junk or a big chunk of binary zeroes apppended to it -- this should be obvious in a hex editor; it also should be obvious in a TEXT editor. Print all the rows that you do get to help identify where the trouble starts.
or (b) the delimiter is a string of one or more whitespace characters (space, tab), the first few rows have tabs, and the remaining rows have spaces. If so, this should be obvious in a hex editor (or in Notepad++, especially if you do View/Show Symbol/Show all characters
). If this is the case, you can't use csv
, you'd need something simple like:
f = open('v2-host_tfdf_en.txt', 'r') # NOT 'rb'
rows = [line.split() for line in f]