4

I need to read a CSV with a couple million rows. The file grows throughout the day. After each time I process the file (and zip each row into a dict), I start the process over again, except creating the dict only for the new lines.

In order to get to the new lines though, I have to iterate over each line with CSV reader and compare the line number to my 'last line read' number (as far as I know).

Is there a way to just 'skip' to that line number?

10mjg
  • 573
  • 1
  • 6
  • 18

2 Answers2

2

You can't go to a specific line number, unless the size of a line is fixed and you know this size. When I say you can't, I mean you can't without loading the whole file in memory and split by \n character.

If your CSV has a fixed-line size like this:

id,code,quantity
0001,ABC43,00100
0002,D2ZAD,00020
....

where each line has the same length, then you could move to linesize*(linenumber+1), where linenumber is the line you want to go.
Otherwise, you need to loop through the whole file to get the n-th line... It exists a built-in module, name linecache which can help you however: Go to a specific line in Python?

Community
  • 1
  • 1
Maxime Lorant
  • 34,607
  • 19
  • 87
  • 97
  • Thank you very much for this helpful response. I will try linecache this afternoon. – 10mjg Feb 13 '14 at 19:59
  • I'm a little curious as to how to proceed once I use linecache to get to the specific line. – 10mjg Feb 13 '14 at 22:32
  • I don't really know how works `linecache` internally. You could iterate your every line by getting `linecache.getline(filename, n)` with `n` from `linenumber`, and stops when it returns an empty string (means the line doesn't exists according to the doc). Check performance, but the doc says that `linecache` manage an internal cache, so it should be fine. – Maxime Lorant Feb 13 '14 at 22:36
  • I'm imagining a use for linecache where I could instruct it to grab all lines from a specific line to the end of the file (or a fixed number of lines, say, 20,000 at a time). If linecache can only grab one line at a time, I think it won't lead to an easy or elegant solution. I am going to continue researching obviously... Thank you... – 10mjg Feb 13 '14 at 22:50
0

If I were doing this I think I would add a marker line after each read - before the file is saved again , then I would read the file in as a string , split on the marker, convert back to a list and feed the list to the process.

PyNEwbie
  • 4,882
  • 4
  • 38
  • 86