Efficiently counting the number of lines of a text file. (200mb+)

Question

I have just found out that my script gives me a fatal error:

Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 440 bytes) in C:\process_txt.php on line 109

That line is this:

$lines = count(file($path)) - 1;

So I think it is having difficulty loading the file into memeory and counting the number of lines, is there a more efficient way I can do this without having memory issues?

The text files that I need to count the number of lines for range from 2MB to 500MB. Maybe a Gig sometimes.

Thanks all for any help.

Dominic Rodger · Accepted Answer · 2010-01-29T14:57:40.837

184

This will use less memory, since it doesn't load the whole file into memory:

$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
  $line = fgets($handle);
  $linecount++;
}

fclose($handle);

echo $linecount;

fgets loads a single line into memory (if the second argument $length is omitted it will keep reading from the stream until it reaches the end of the line, which is what we want). This is still unlikely to be as quick as using something other than PHP, if you care about wall time as well as memory usage.

The only danger with this is if any lines are particularly long (what if you encounter a 2GB file without line breaks?). In which case you're better off doing slurping it in in chunks, and counting end-of-line characters:

$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
  $line = fgets($handle, 4096);
  $linecount = $linecount + substr_count($line, PHP_EOL);
}

fclose($handle);

echo $linecount;

edited Jan 29 '10 at 14:57

answered Jan 29 '10 at 14:31

Dominic Rodger

97,747
36
197
212

Thanks for the explanation Dominic - that looks good. I had a feeling it had to be done line by line and not letting count of file load the whole thing into memory! – Abs Jan 29 '10 at 14:38
The only danger of this snippet are huge files without linebreaks as fgets will then try to suck up the whole file. It'd be safer to read 4kB chunks at a time and count line termination characters. – David Schmitt Jan 29 '10 at 14:51
@David - how does my edit look? I'm not 100% confident about `PHP_EOL` - does that look right? – Dominic Rodger Jan 29 '10 at 14:58
6

not perfect: you could have a unix-style file (`\n`) being parsed on a windows machine (`PHP_EOL == '\r\n'`) – nickf Jan 29 '10 at 15:01
@nickf - good point. How would you address it? How does `fgets` work? – Dominic Rodger Jan 29 '10 at 15:23
1

Why not improve a bit by limiting the line reading to 1 ? Since we only want to count the number of lines, why not do a `fgets($handle, 1);` ? – Cyril N. Nov 17 '14 at 15:22
1

@CyrilN. This depends on your setup. If you're having mostly files that contain only some chars per line it could be faster because you don't need to use `substr_count()`, but if you are having very long lines you need to call `while()` and `fgets()` much more causing a disadvantage. *Do not forget:* `fgets()` does not read line by line. It reads only the amount of chars you defined through `$length` and *if* it contains a linebreak it stops whatever `$length` have been set. – mgutt Mar 31 '15 at 04:34
@DominicRodger instead of using substr_count() you should use strpos() as `$line` will never include more than one linebreak. Or better use `$last = strlen($line) - 1; if ($line[ $last ] == "\n" || $line[ $last ] == "\r") { $linecount++; }`. This should be the fastest option. – mgutt Mar 31 '15 at 04:42
3

Won't this return 1 more than the number of lines? `while(!feof())` will cause you to read an extra line, because the EOF indicator isn't set until after you try to read at the end of file. – Barmar Apr 29 '15 at 12:10
2

@DominicRodger in the first example I believe `$line = fgets($handle);` could just be `fgets($handle);` because `$line` is never used. – Pocketsand Oct 23 '16 at 15:02
1

For the first solution: It counts an extra line because the loop runs once more than is necessary. To fix that, you need to move the `fgets` call to the end of the loop and clone it once above the loop as well. – ab3000 Apr 13 '17 at 07:53
Second function will return wrong count if last line contains some text, but no eol. – eyedmax Feb 20 '22 at 11:47

Ja͢ck · Answer 2 · 2022-02-27T05:40:06.817

121

Using a loop of fgets() calls is fine solution and the most straightforward to write, however:

even though internally the file is read using a buffer of 8192 bytes, your code still has to call that function for each line.
it's technically possible that a single line may be bigger than the available memory if you're reading a binary file.

This code reads a file in chunks of 8kB each and then counts the number of newlines within that chunk.

function getLines($file)
{
    $f = fopen($file, 'rb');
    $lines = 0;

    while (!feof($f)) {
        $lines += substr_count(fread($f, 8192), "\n");
    }

    fclose($f);

    return $lines;
}

If the average length of each line is at most 4kB, you will already start saving on function calls, and those can add up when you process big files.

Benchmark

I ran a test with a 1GB file; here are the results:

             +-------------+------------------+---------+
             | This answer | Dominic's answer | wc -l   |
+------------+-------------+------------------+---------+
| Lines      | 3550388     | 3550389          | 3550388 |
+------------+-------------+------------------+---------+
| Runtime    | 1.055       | 4.297            | 0.587   |
+------------+-------------+------------------+---------+

Time is measured in seconds real time, see here what real means

True line count

While the above works well and returns the same results as wc -l, if the file ends without a newline, the line number will be off by one; if you care about this particular scenario, you can make it more accurate by using this logic:


function getLines($file)
{
    $f = fopen($file, 'rb');
    $lines = 0; $buffer = '';

    while (!feof($f)) {
        $buffer = fread($f, 8192);
        $lines += substr_count($buffer, "\n");
    }

    fclose($f);

    if (strlen($buffer) > 0 && $buffer[-1] != "\n") {
        ++$lines;
    }
    return $lines;
}

edited Feb 27 '22 at 05:40

answered Dec 12 '13 at 07:08

Ja͢ck

170,779
38
263
309

Curious how faster (?) it will be if you extend the buffer size to something like 64k. PS: if only php had some *easy* way to make IO asynchronous in this case – zerkms Dec 12 '13 at 21:51
@zerkms To answer your question, with 64kB buffers it becomes 0.2 seconds faster on 1GB :) – Ja͢ck Dec 13 '13 at 03:19
Interesting. What about skipping empty lines? – psobko May 14 '14 at 23:07
3

Be careful with this benchmark, which did you run first? The second one will have the benefit of the file already being in disk cache, massively skewing the result. – Oliver Charlesworth Aug 28 '14 at 21:20
7

@OliCharlesworth they're averages over five runs, skipping the first run :) – Ja͢ck Aug 28 '14 at 23:32
1

This answer is great! However, IMO, it must test when there is some character in the last line to add 1 in the line count: https://pastebin.com/yLwZqPR2 – caligari Jan 23 '18 at 17:22
Function will return wrong count if last line contains some text, but no eol. – eyedmax Feb 20 '22 at 11:47
@eyedmax Surprisingly (or maybe not so) `wc -l` outputs the same number of lines in that condition (i tested with `echo -n "hello world" > file.txt` and both return 0) – Ja͢ck Feb 27 '22 at 05:28

score 67 · Answer 3 · edited Jul 20 '21 at 20:12

67

Simple Oriented Object solution

$file = new \SplFileObject('file.extension');

while($file->valid()) $file->fgets();

var_dump($file->key());

#Update

Another way to make this is with PHP_INT_MAX in SplFileObject::seek method.

$file = new \SplFileObject('file.extension', 'r');
$file->seek(PHP_INT_MAX);

echo $file->key();

edited Jul 20 '21 at 20:12

But those new buttons though..

21,377
10
81
108

answered Jul 24 '15 at 13:18

Wallace Vizerra

3,382
2
28
29

4

The second solution is great and uses Spl! Thanks. – Daniele Orlando Jan 23 '16 at 19:14
4

Thank you ! This is, indeed, great. And faster than calling `wc -l` (because of the forking I suppose), especially on small files. – Drasill Feb 18 '16 at 16:44
I didn't thought that the solution would be so helpful! – Wallace Vizerra Mar 14 '16 at 15:49
1

Excellent solution! – Dalibor Karlović Aug 05 '16 at 20:58
2

This is the best solution by far – Valdrinium Oct 25 '17 at 09:58
2

Is the "key() + 1" right? I tried it and seems wrong. For a given file with line endings on every line including the last, this code gives me 3998. But if I do "wc" on it, I get 3997. If I use "vim", it says 3997L (and does not indicate missing EOL). So I think the "Update" answer is wrong. – user9645 Mar 03 '20 at 14:06
@user9645 the `key` starts of zero value. Considering that file contains one line, `key` will be returned `0`, but the correct is `1` – Wallace Vizerra Mar 02 '21 at 12:53
1

@WallaceMaxters - for whatever reason, this is wrong. I've tested on a zero length and 1 line file and removing the `+ 1` gets the correct line count regardless of file length. Great answer though - thanks! – But those new buttons though.. Jul 20 '21 at 20:16

score 37 · Answer 4 · answered Jan 29 '10 at 14:30

37

If you're running this on a Linux/Unix host, the easiest solution would be to use exec() or similar to run the command wc -l $path. Just make sure you've sanitized $path first to be sure that it isn't something like "/path/to/file ; rm -rf /".

answered Jan 29 '10 at 14:30

Dave Sherohman

45,363
14
64
102

I am on a windows machine! If I was, I think that would be the best solution! – Abs Jan 29 '10 at 14:39
26

@ghostdog74: Why, yes, you're right. It is non-portable. That's why I explicitly acknowledged my suggestion's non-portability by prefacing it with the clause "If you're running this on a Linux/Unix host...". – Dave Sherohman Jan 30 '10 at 10:11
1

Non portable (though useful in some situations), but exec (or shell_exec or system) are a system call, which are considerably slower compared to PHP built-in functions. – Manz Nov 08 '12 at 02:15
11

@Manz: Why, yes, you're right. It is non-portable. That's why I explicitly acknowledged my suggestion's non-portability by prefacing it with the clause "If you're running this on a Linux/Unix host...". – Dave Sherohman Nov 12 '12 at 12:00
@DaveSherohman Yes, you're right, sorry. IMHO, I think the most important issue is the time consuming in a system call (especially if you need to use frequently) – Manz Nov 13 '12 at 19:42
1

@Manz it is still 8 times faster (or more) on big files (see Jack's answer). – Dejan Marjanović Dec 12 '13 at 07:28
This does not work with CSVs created with Excel on MacBooks. They only have carriage returns, and no newline, for line terminators. – Parris Varney Mar 10 '14 at 21:09

Andy Braham · Answer 5 · 2013-11-11T17:11:18.567

34

There is a faster way I found that does not require looping through the entire file

only on *nix systems, there might be a similar way on windows ...

$file = '/path/to/your.file';

//Get number of lines
$totalLines = intval(exec("wc -l '$file'"));

edited Nov 11 '13 at 17:11

answered Mar 17 '13 at 21:18

Andy Braham

9,594
4
48
56

add 2>/dev/null to suppress the "No such file or directory" – Tegan Snyder May 03 '13 at 21:22
1

$total_lines = intval(exec("wc -l '$file'")); will handle file names with spaces. – pgee70 Nov 11 '13 at 12:22
1

Thanks pgee70 didn't come across that yet but makes sense, I updated my answer – Andy Braham Nov 11 '13 at 17:12
7

`exec('wc -l '.escapeshellarg($file).' 2>/dev/null')` – Zheng Kai May 30 '14 at 08:32
1

Looks like the answer by @DaveSherohman above posted 3 years before this one – Déjà vu Apr 01 '20 at 11:52

Ben Harold · Answer 6 · 2013-12-12T16:11:12.843

If you're using PHP 5.5 you can use a generator. This will NOT work in any version of PHP before 5.5 though. From php.net:

"Generators provide an easy way to implement simple iterators without the overhead or complexity of implementing a class that implements the Iterator interface."

// This function implements a generator to load individual lines of a large file
function getLines($file) {
    $f = fopen($file, 'r');

    // read each line of the file without loading the whole file to memory
    while ($line = fgets($f)) {
        yield $line;
    }
}

// Since generators implement simple iterators, I can quickly count the number
// of lines using the iterator_count() function.
$file = '/path/to/file.txt';
$lineCount = iterator_count(getLines($file)); // the number of lines in the file

The `try`/`finally` is not strictly necessary, PHP will automatically close the file for you. You should probably also mention that the actual counting can be done using `iterator_count(getFiles($file))` :) — NikiC, Oct 13 '13 at 09:34

score 9 · Answer 7 · answered May 25 '18 at 08:47

9

If you're under linux you can simply do:

number_of_lines = intval(trim(shell_exec("wc -l ".$file_name." | awk '{print $1}'")));

You just have to find the right command if you're using another OS

Regards

answered May 25 '18 at 08:47

elkolotfi

5,645
2
15
19

score 7 · Answer 8 · edited Mar 03 '21 at 06:37

7

This is an addition to Wallace Maxter's solution

It also skips empty lines while counting:

function getLines($file)
{
    $file = new \SplFileObject($file, 'r');
    $file->setFlags(SplFileObject::READ_AHEAD | SplFileObject::SKIP_EMPTY | 
SplFileObject::DROP_NEW_LINE);
    $file->seek(PHP_INT_MAX);

    return $file->key() + 1; 
}

edited Mar 03 '21 at 06:37

Community

1
1

answered Jun 28 '17 at 07:09

Jani

349
3
8

score 2 · Answer 9 · answered Dec 23 '16 at 19:48

Based on dominic Rodger's solution, here is what I use (it uses wc if available, otherwise fallbacks to dominic Rodger's solution).

class FileTool
{

    public static function getNbLines($file)
    {
        $linecount = 0;

        $m = exec('which wc');
        if ('' !== $m) {
            $cmd = 'wc -l < "' . str_replace('"', '\\"', $file) . '"';
            $n = exec($cmd);
            return (int)$n + 1;
        }


        $handle = fopen($file, "r");
        while (!feof($handle)) {
            $line = fgets($handle);
            $linecount++;
        }
        fclose($handle);
        return $linecount;
    }
}

https://github.com/lingtalfi/Bat/blob/master/FileTool.php

Quolonel Questions · Answer 10 · 2019-09-20T22:09:45.510

2

The most succinct cross-platform solution that only buffers one line at a time.

$file = new \SplFileObject(__FILE__);
$file->setFlags($file::READ_AHEAD);
$lines = iterator_count($file);

Unfortunately, we have to set the READ_AHEAD flag otherwise iterator_count blocks indefinitely. Otherwise, this would be a one-liner.

edited Sep 20 '19 at 22:09

answered Sep 20 '19 at 14:46

Quolonel Questions

6,603
2
32
33

score 1 · Answer 11 · answered Jan 30 '13 at 07:38

private static function lineCount($file) {
    $linecount = 0;
    $handle = fopen($file, "r");
    while(!feof($handle)){
        if (fgets($handle) !== false) {
                $linecount++;
        }
    }
    fclose($handle);
    return  $linecount;     
}

I wanted to add a little fix to the function above...

in a specific example where i had a file containing the word 'testing' the function returned 2 as a result. so i needed to add a check if fgets returned false or not :)

have fun :)

score 1 · Answer 12 · edited Apr 02 '18 at 14:58

1

Counting the number of lines can be done by following codes:

<?php
$fp= fopen("myfile.txt", "r");
$count=0;
while($line = fgetss($fp)) // fgetss() is used to get a line from a file ignoring html tags
$count++;
echo "Total number of lines  are ".$count;
fclose($fp);
?>

edited Apr 02 '18 at 14:58

Sterling Archer

22,070
18
81
118

answered Apr 02 '18 at 14:34

Santosh Kumar

11
2

score 0 · Answer 13 · answered Jan 29 '10 at 14:31

You have several options. The first is to increase the availble memory allowed, which is probably not the best way to do things given that you state the file can get very large. The other way is to use fgets to read the file line by line and increment a counter, which should not cause any memory issues at all as only the current line is in memory at any one time.

score 0 · Answer 14 · answered Aug 02 '14 at 23:45

There is another answer that I thought might be a good addition to this list.

If you have perl installed and are able to run things from the shell in PHP:

$lines = exec('perl -pe \'s/\r\n|\n|\r/\n/g\' ' . escapeshellarg('largetextfile.txt') . ' | wc -l');

This should handle most line breaks whether from Unix or Windows created files.

TWO downsides (at least):

1) It is not a great idea to have your script so dependent upon the system its running on ( it may not be safe to assume Perl and wc are available )

2) Just a small mistake in escaping and you have handed over access to a shell on your machine.

As with most things I know (or think I know) about coding, I got this info from somewhere else:

John Reeve Article

score 0 · Answer 15 · answered Aug 28 '14 at 21:02

public function quickAndDirtyLineCounter()
{
    echo "<table>";
    $folders = ['C:\wamp\www\qa\abcfolder\',
    ];
    foreach ($folders as $folder) {
        $files = scandir($folder);
        foreach ($files as $file) {
            if($file == '.' || $file == '..' || !file_exists($folder.'\\'.$file)){
                continue;
            }
                $handle = fopen($folder.'/'.$file, "r");
                $linecount = 0;
                while(!feof($handle)){
                    if(is_bool($handle)){break;}
                    $line = fgets($handle);
                    $linecount++;
                  }
                fclose($handle);
                echo "<tr><td>" . $folder . "</td><td>" . $file . "</td><td>" . $linecount . "</td></tr>";
            }
        }
        echo "</table>";
}

Please consider adding at least some words explaining to the OP and to further readers of you answer why and how it does reply to the original question. — β.εηοιτ.βε, Jul 01 '15 at 21:02

score 0 · Answer 16 · answered Oct 26 '17 at 14:24

0

I use this method for purely counting how many lines in a file. What is the downside of doing this verses the other answers. I'm seeing many lines as opposed to my two line solution. I'm guessing there's a reason nobody does this.

$lines = count(file('your.file'));
echo $lines;

answered Oct 26 '17 at 14:24

kaspirtk1

11

3

The original solution was this. But since file() loads the entire file in memory this was also the original issue (Memory exhaustion) so no, this isn't a solution for the question. – Tuim Oct 26 '17 at 14:43

score 0 · Answer 17 · answered Nov 03 '21 at 10:01

this is a bit late but...

Here is my solution for a text log file I have which uses \n to separate each line.

$data = file_get_contents("myfile.txt");
$numlines = strlen($data) - strlen(str_replace("\n","",$data));

It does load the file into memory but doesn't need to cycle through an unknown number of lines. It may be unsuitable if the file is GB in size but for smaller files with short lines of data it works a treat for me.

It just removes the "\n" from the file and compares how many have been removed by comparing the length of the data in the file to the length after removing all the line breaks ("\n" chars n my case). If your line delineator is a different char, replace the "\n" with whatever is your line delineation character.

I know it is not the best answer for all occasions but is something I have found quick and simple for my purposes where each line of the log is only a few hundred chars and total log file is not too large.

score -1 · Answer 18 · answered Feb 19 '15 at 15:28

-1

For just counting the lines use:

$handle = fopen("file","r");
static $b = 0;
while($a = fgets($handle)) {
    $b++;
}
echo $b;

answered Feb 19 '15 at 15:28

Adeel Ahmad

1

Efficiently counting the number of lines of a text file. (200mb+)

18 Answers18

Benchmark

True line count

Linked

Related