4

Does there exist a way of finding the number of lines in a csv file without actually loading the whole file in memory (in Python)?

I'd expect there can be some special optimized function for it. All I can imagine now is read it line by line and count the lines, but it kind of kills all the possible sense in it since I only need the number of lines, not the actual content.

sashkello
  • 17,306
  • 24
  • 81
  • 109
  • 1
    I would think not. CSV does not store any meta information in a header or footer of the file. Therefore obtaining the amount of lines requires to read the entire file. As this is typically achieved by reading lines up to \r\n or \n (depends on encoding) this means there is no faster way doing that. However it might be quicker to read the file character-wise (not line-wise) and just count the newlines. – Samuel Sep 26 '13 at 06:47
  • Determine colour of cow without looking at it – Hyperboreus Sep 26 '13 at 06:53
  • The duplicate has some very nice answers. I voted to close it... Or should I delete this question? – sashkello Sep 26 '13 at 06:53
  • @Hyperboreus I can determine a color of a cow by looking into a cow database of a farm without a need of going to the farm and using a spectrometer. – sashkello Sep 26 '13 at 06:54
  • @sashkello Exactely. Do you have a csv file database of your farm? – Hyperboreus Sep 26 '13 at 06:55
  • @Hyperboreus I'm just refuting your claim. I do not know if there are any alternative tools to speed this up, that's why I'm asking. It is not obvious, that's what I'm saying. – sashkello Sep 26 '13 at 06:56
  • @sashkello: On a different context, why do you want to know? Probably you are solving the wrong problem? – Abhijit Sep 26 '13 at 07:04
  • @Abhijit Well, it is not really a crucial question for me, just a matter of convenience: I have scripts regularly processing some huge files, and I'd like to know how much lines are in there so that I know how much is left for processing. I now think you can possibly do it with counting the number of megabytes processed rather than with the actual lines... I wouldn't want this feature to become a memory or performance issue. – sashkello Sep 26 '13 at 07:07
  • @sashkello: In such scenario, as you don;t need an exact value, I would suggest you get an estimate. May be by Just reading a fractions of the lines in the whole file and calculate the average. This would give you an average line size, then simply divide the size of the file by the average line size. – Abhijit Sep 26 '13 at 07:11
  • Yes, that's a good idea (as per @fabrizioM answer as well). – sashkello Sep 26 '13 at 07:13
  • @Samuel - Counting newline characters does not work with CSV files, as they can appear inside cell values. More generally, counting the number of lines of a CSV file is not a reliable way to get the number of rows. – mouviciel Sep 26 '13 at 07:16
  • 1
    @mouviciel - Alternatively the number of delimiters. Or if line count has no requirement of being exact: Guessing by filesize and average line width of a (uniformly distributed?) n-sample set from the file. – Samuel Sep 26 '13 at 09:27

2 Answers2

10

You don't need to load the whole file into memory since files are iterable in terms of their lines:

with open(path) as fp:
    count = 0
    for _ in fp:
        count += 1

Or, slightly more idiomatic:

with open(path) as fp:
    for (count, _) in enumerate(fp, 1):
       pass
bereal
  • 32,519
  • 6
  • 58
  • 104
  • 6
    that reads the whole file in memory (does not store it but it reads it) – fabrizioM Sep 26 '13 at 06:50
  • 2
    @fabrizioM "reads", yes, but not "loads" – bereal Sep 26 '13 at 06:51
  • 1
    Well, you still have to read the whole file into memory, you just don't have to keep it in memory completely at once, you can iterate over chunks. – Thilo Sep 26 '13 at 06:51
  • 4
    @aychedee "reads entire file from disk" - yes, "loads the whole file into memory" - no. As far as we can see from the discussion, that's what the OP is asking about. – bereal Sep 26 '13 at 07:23
5

Yes you need to read the whole file in memory before knowing how many lines are in it. Just think the file to be a long long string Aaaaabbbbbbbcccccccc\ndddddd\neeeeee\n to know how many 'lines' are in the string you need to find how many \n characters are in it.

If you want an approximate number what you can do is to read few lines (~20) and see how many characters are per lines and then from the file's size (stored in the file descriptor) get a possible estimate.

fabrizioM
  • 46,639
  • 15
  • 102
  • 119
  • 1
    Don't use the first lines: they might contain the header, which is not representative of the actual data. – mouviciel Sep 26 '13 at 07:18
  • This answer is ignoring the many ways to process files larger than memory, e.g. just stream the file one line at a time and count the number of lines. – Abhishek Divekar Jun 16 '20 at 08:49
  • I said read, not store. even line by line the file will be moved/read into memory, I will reword to be explicit – fabrizioM Jun 16 '20 at 10:01