1

I am working on a file with 156371992 rows and using the CSV package of python. But it always loads only the first 34739332. It is not throwing any error which I suppose is because the reader is believing to have reached the end of the file which is far from it. I couldn't find anything docs, am adding the code snippet too

has_header = csv.Sniffer().has_header(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile)
if has_header:
    next(reader)
print("len of reader", len(list(reader)))

Which always gives 34739332 as value. Any explanations?

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Ajay Tom George
  • 1,890
  • 1
  • 14
  • 26
  • 1
    Can't reproduce. Check your file closely. Is there a line with a wrong number of commas? Try to debug by printing `list(reader)[-1]` to see what it has as the last line. Also, do you really have a CSV file with 156 million rows in it? if so, maybe it's time to switch to a proper database – DeepSpace Jan 22 '21 at 18:39
  • `has_header = csv.Sniffer().has_header(csvfile.read(1024))` it seems you are limiting by `read(1024)`. Can you try with big number like `has_header = csv.Sniffer().has_header(csvfile.read(1024*2))`? – Epsi95 Jan 22 '21 at 18:41
  • 1
    I think the problem is memory. You should not load all the data at once. iteration will help. – r.burak Jan 22 '21 at 18:44
  • 1
    @r.b.leon might have a point. 34739332 is suspiciously close to `2 ** 25`, so it might be Python / your OS limitation on your environment, although I would have expected to see a `MemoryError` being raised if that was the case. – DeepSpace Jan 22 '21 at 18:46
  • If this is Windows, make sure the file doesn't have a Ctrl-Z in it. – Mark Ransom Jan 22 '21 at 18:53
  • @DeepSpace, I had checked the rows near 34739332 manually, there was no error. I had loaded the data via iteration too via for loop but results were similar. I think your opinions are right regarding it as a memory issue. But I too expected to see a memory error. Will try on different spec computer then – Ajay Tom George Jan 22 '21 at 19:04
  • @MarkRansom I am running ubuntu though – Ajay Tom George Jan 22 '21 at 19:06
  • @DeepSpace To let you know, there was one corrupt line in the file at that point as you said, when checked manually, the open office removed the corrupted line so didn't discover it. But the irony is that none of the packages threw an error or skipped to the next line,. – Ajay Tom George Jan 25 '21 at 15:13

1 Answers1

-3

For datasets, It might be better to import read_csv from pandas rather than use the csv library. Try

from pandas import read_csv
dataset = read_csv(csvfile)

This will create a pandas Dataframe. If you need to manipulate it, the pandas library functions should be adequate. If not, you can import numpy and use dataset = numpy.array(dataset).

If that doesn't work, try importing NumPy and using genfromtxt instead.

Falcon72
  • 1
  • 2