228

I use huge data files, sometimes I only need to know the number of lines in these files, usually I open them up and read them line by line until I reach the end of the file

I was wondering if there is a smarter way to do that

Jon Seigel
  • 12,251
  • 8
  • 58
  • 92
Mark
  • 10,754
  • 20
  • 60
  • 81

19 Answers19

250

This is the fastest version I have found so far, about 6 times faster than readLines. On a 150MB log file this takes 0.35 seconds, versus 2.40 seconds when using readLines(). Just for fun, linux' wc -l command takes 0.15 seconds.

public static int countLinesOld(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean empty = true;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
        }
        return (count == 0 && !empty) ? 1 : count;
    } finally {
        is.close();
    }
}

EDIT, 9 1/2 years later: I have practically no java experience, but anyways I have tried to benchmark this code against the LineNumberReader solution below since it bothered me that nobody did it. It seems that especially for large files my solution is faster. Although it seems to take a few runs until the optimizer does a decent job. I've played a bit with the code, and have produced a new version that is consistently fastest:

public static int countLinesNew(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        
        int readChars = is.read(c);
        if (readChars == -1) {
            // bail out if nothing to read
            return 0;
        }
        
        // make it easy for the optimizer to tune this loop
        int count = 0;
        while (readChars == 1024) {
            for (int i=0; i<1024;) {
                if (c[i++] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }
        
        // count remaining characters
        while (readChars != -1) {
            for (int i=0; i<readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }
        
        return count == 0 ? 1 : count;
    } finally {
        is.close();
    }
}

Benchmark resuls for a 1.3GB text file, y axis in seconds. I've performed 100 runs with the same file, and measured each run with System.nanoTime(). You can see that countLinesOld has a few outliers, and countLinesNew has none and while it's only a bit faster, the difference is statistically significant. LineNumberReader is clearly slower.

Benchmark Plot

martinus
  • 17,736
  • 15
  • 72
  • 92
  • you were right david, I thought the JVM would be good enough for this... I have updated the code, this one is faster. – martinus Jan 17 '09 at 10:01
  • 5
    BufferedInputStream should be doing the buffering for you, so I don't see how using an intermediate byte[] array will make it any faster. You're unlikely to do much better than using readLine() repeatedly anyway (since that will be optimized towards by the API). – wds Jan 17 '09 at 13:23
  • Ive benchmarked it with and without the buffered inputstream, and it is afaster when using it. – martinus Jan 17 '09 at 13:32
  • 56
    You're going to close that InputStream when you're done with it, aren't you? – bendin May 24 '09 at 18:15
  • 5
    If buffering helped it would because BufferedInputStream buffers 8K by default. Increase your byte[] to this size or larger and you can drop the BufferedInputStream. e.g. try 1024*1024 bytes. – Peter Lawrey May 24 '09 at 19:02
  • 1
    Works good until I use it on some MAC format files or some files in which the last line doesn't have a '\n' character. The number will be incorrect in those situations. Although it is fast but I think I will stick to to "fit-all" readLine() method. – newguy Mar 14 '11 at 06:21
  • 10
    Two things: (1) The definition of a line terminator in Java source is a carriage return, a line feed, or a carriage return followed by a line feed. Your solution won't work for CR used as a line terminator. Granted, the only OS of which I can think that uses CR as the default line terminator is Mac OS prior to Mac OS X. (2) Your solution assumes a character encoding such as US-ASCII or UTF-8. The line count may be inaccurate for encodings such as UTF-16. – Nathan Ryan Sep 21 '12 at 11:58
  • @Nathan_Ryan: I just got logs from java app outputting some mainframe TCP service responses and there were a number of CRs inside. The program using the snippet above gracefully failed. – serg.nechaev Nov 28 '13 at 07:01
  • Nice. I would make this method static and rename it countLines. Cheers – doc Mar 28 '14 at 10:05
  • For what it's worth, I already had the byte[] and used the following: ` private int countLines(byte[] file) throws IOException { InputStream is = new ByteArrayInputStream(file);` – Peter Feb 20 '15 at 13:30
  • 1
    This method shows one line less... Try to look at my answer below. – Ernestas Gruodis Feb 20 '15 at 22:43
  • 1
    It will fail on files which use something else than something which includes `\n` as a line terminator. The count is off by one (one less) for `noeol` files. What actually needs to be counted is not the number of `\n` but the number of occurrences of character sequences separated by line terminator. – Christian Hujer Mar 05 '15 at 13:16
  • 1
    a try with resources is a better way to do this. try(InputStream is = new BufferedInputStream(new FileInputStream(filename))){ //rest of the code as above without the finally block } – user4321 Aug 29 '16 at 18:43
  • 2
    Awesome code... for 400mb text file, it took just a second. Thanks alot @martinus – user3181500 Nov 02 '17 at 12:43
201

I have implemented another solution to the problem, I found it more efficient in counting rows:

try
(
   FileReader       input = new FileReader("input.txt");
   LineNumberReader count = new LineNumberReader(input);
)
{
   while (count.skip(Long.MAX_VALUE) > 0)
   {
      // Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
   }

   result = count.getLineNumber() + 1;                                    // +1 because line index starts at 0
}
Nathan
  • 8,093
  • 8
  • 50
  • 76
er.vikas
  • 2,151
  • 1
  • 12
  • 3
  • `LineNumberReader`'s `lineNumber` field is an integer... Won't it just wrap for files longer than Integer.MAX_VALUE? Why bother skipping by a long here? – epb Apr 03 '15 at 20:27
  • 1
    Adding one to the count is actually incorrect. `wc -l` counts the number of newline chars in the file. This works since every line is terminated with a newline, including the final line in a file. Every line has a newline character, including the empty lines, hence that the number of newline chars == number of lines in a file. Now, the `lineNumber` variable in `FileNumberReader` also represents the number of newline chars seen. It starts at zero, before any newline has been found, and is increased with every newline char seen. So don't add one to the line number please. – Alexander Torstling Feb 16 '16 at 14:06
  • 1
    @PB_MLT: Although you are right that a file with a single line without newline would be reported as 0 lines, this is how `wc -l` also reports this kind of file. Also see http://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline – Alexander Torstling Feb 16 '16 at 14:10
  • @PB_MLT: You get the opposite problem if the file consists solely of a newline. Your suggested algo would return 0 and `wc -l` would return 1. I concluded that all methods have flaws, and implemented one based on how I would like it to behave, see my other answer here. – Alexander Torstling Feb 16 '16 at 14:50
  • 3
    I've down voted this response, because it seems none of you have benchmarked it – amstegraf Feb 01 '17 at 19:01
30

The accepted answer has an off by one error for multi line files which don't end in newline. A one line file ending without a newline would return 1, but a two line file ending without a newline would return 1 too. Here's an implementation of the accepted solution which fixes this. The endsWithoutNewLine checks are wasteful for everything but the final read, but should be trivial time wise compared to the overall function.

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean endsWithoutNewLine = false;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
            endsWithoutNewLine = (c[readChars - 1] != '\n');
        }
        if(endsWithoutNewLine) {
            ++count;
        } 
        return count;
    } finally {
        is.close();
    }
}
DMulligan
  • 8,993
  • 6
  • 33
  • 34
  • 6
    Good catch. Not sure why you didn't just edit the accepted answer and make a note in a comment though. Most people won't read down this far. – Ryan Dec 11 '13 at 21:33
  • @Ryan , it just didn't feel right to edit a 4 year old accepted answer with 90+ upvotes. – DMulligan Dec 12 '13 at 06:47
  • @AFinkelstein, I feel that is what makes this site so great, that you *can* edit the top voted answer. – Sebastian Jan 27 '14 at 08:48
  • 4
    This solution does not handle carriage return (\r) and carriage return followed by a linefeed (\r\n) – Simon Brandhof Feb 05 '14 at 13:36
  • 1
    @Simon Brandhof, I'm confused on why a carriage return would be counted as another line? A "\n" is a Carriage return line feed, so whoever writes "\r\n" is not understanding something... Plus he is searching char by char, so I'm pretty sure if someone were to use "\r\n" it would still catch the "\n" and count the line. Either way I think he made the point just fine. However, their are many scenarios where this is not a sufficient way to get a line count. – nckbrz Apr 08 '14 at 03:46
26

With , you can use streams:

try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
  long numOfLines = lines.count();
  ...
}
4castle
  • 32,613
  • 11
  • 69
  • 106
msayag
  • 8,407
  • 4
  • 30
  • 29
13

The answer with the method count() above gave me line miscounts if a file didn't have a newline at the end of the file - it failed to count the last line in the file.

This method works better for me:

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber(); 
reader.close();
return cnt;
}
Dave Bergert
  • 176
  • 1
  • 4
  • In this case, there is no need of using LineNumberReader, simply use BufferedReader, in that case you'l have flexibility to use long datatype for `cnt`. – Aqeel Ashiq Jan 30 '14 at 08:02
  • [INFO] PMD Failure:xx:19 Rule:EmptyWhileStmt Priority:3 Avoid empty while statements. – Chhorn Elit Jan 01 '20 at 16:49
11

I tested the above methods for counting lines and here are my observations for Different methods as tested on my system

File Size : 1.6 Gb Methods:

  1. Using Scanner : 35s approx
  2. Using BufferedReader : 5s approx
  3. Using Java 8 : 5s approx
  4. Using LineNumberReader : 5s approx

Moreover Java8 Approach seems quite handy :

Files.lines(Paths.get(filePath), Charset.defaultCharset()).count()
[Return type : long]
whoami - fakeFaceTrueSoul
  • 17,086
  • 6
  • 32
  • 46
Anshul
  • 415
  • 4
  • 15
8

I know this is an old question, but the accepted solution didn't quite match what I needed it to do. So, I refined it to accept various line terminators (rather than just line feed) and to use a specified character encoding (rather than ISO-8859-n). All in one method (refactor as appropriate):

public static long getLinesCount(String fileName, String encodingName) throws IOException {
    long linesCount = 0;
    File file = new File(fileName);
    FileInputStream fileIn = new FileInputStream(file);
    try {
        Charset encoding = Charset.forName(encodingName);
        Reader fileReader = new InputStreamReader(fileIn, encoding);
        int bufferSize = 4096;
        Reader reader = new BufferedReader(fileReader, bufferSize);
        char[] buffer = new char[bufferSize];
        int prevChar = -1;
        int readCount = reader.read(buffer);
        while (readCount != -1) {
            for (int i = 0; i < readCount; i++) {
                int nextChar = buffer[i];
                switch (nextChar) {
                    case '\r': {
                        // The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
                        linesCount++;
                        break;
                    }
                    case '\n': {
                        if (prevChar == '\r') {
                            // The current line is terminated by a carriage return immediately followed by a line feed.
                            // The line has already been counted.
                        } else {
                            // The current line is terminated by a line feed.
                            linesCount++;
                        }
                        break;
                    }
                }
                prevChar = nextChar;
            }
            readCount = reader.read(buffer);
        }
        if (prevCh != -1) {
            switch (prevCh) {
                case '\r':
                case '\n': {
                    // The last line is terminated by a line terminator.
                    // The last line has already been counted.
                    break;
                }
                default: {
                    // The last line is terminated by end-of-file.
                    linesCount++;
                }
            }
        }
    } finally {
        fileIn.close();
    }
    return linesCount;
}

This solution is comparable in speed to the accepted solution, about 4% slower in my tests (though timing tests in Java are notoriously unreliable).

Nathan Ryan
  • 12,893
  • 4
  • 26
  • 37
5
/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (Stream<String> lines = Files.lines(file.toPath())) {
        return lines.count();
    }
}

Tested on JDK8_u31. But indeed performance is slow compared to this method:

/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {

        byte[] c = new byte[1024];
        boolean empty = true,
                lastEmpty = false;
        long count = 0;
        int read;
        while ((read = is.read(c)) != -1) {
            for (int i = 0; i < read; i++) {
                if (c[i] == '\n') {
                    count++;
                    lastEmpty = true;
                } else if (lastEmpty) {
                    lastEmpty = false;
                }
            }
            empty = false;
        }

        if (!empty) {
            if (count == 0) {
                count = 1;
            } else if (!lastEmpty) {
                count++;
            }
        }

        return count;
    }
}

Tested and very fast.

Ernestas Gruodis
  • 8,567
  • 14
  • 55
  • 117
  • This isn't correct. Made some experiments with your code and the method is always slower. `Stream - Time consumed: 122796351 Stream - Num lines: 109808 Method - Time consumed: 12838000 Method - Num lines: 1` And the number of lines is even wrong too – aw-think Feb 27 '15 at 12:59
  • I tested on 32-bit machine. Maybe on 64-bit would be different results.. And it was the difference 10 times or more as I remember. Could you post the text to count line somewhere? You can use Notepad2 to see line breaks for convenience. – Ernestas Gruodis Feb 27 '15 at 13:01
  • That could be the difference. – aw-think Feb 27 '15 at 13:02
  • If you care about performance, you should not use a `BufferedInputStream` when you are going to read into your own buffer anyway. Besides, even if your method might have a slight performance advantage, it looses flexibility, as it doesn’t support sole `\r` line terminators (old MacOS) anymore and doesn’t support every encoding. – Holger Nov 14 '16 at 18:58
4

A straight-forward way using Scanner

static void lineCounter (String path) throws IOException {

        int lineCount = 0, commentsCount = 0;

        Scanner input = new Scanner(new File(path));
        while (input.hasNextLine()) {
            String data = input.nextLine();

            if (data.startsWith("//")) commentsCount++;

            lineCount++;
        }

        System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
    }
Terry Bu
  • 889
  • 1
  • 14
  • 31
3

I concluded that wc -l:s method of counting newlines is fine but returns non-intuitive results on files where the last line doesn't end with a newline.

And @er.vikas solution based on LineNumberReader but adding one to the line count returned non-intuitive results on files where the last line does end with newline.

I therefore made an algo which handles as follows:

@Test
public void empty() throws IOException {
    assertEquals(0, count(""));
}

@Test
public void singleNewline() throws IOException {
    assertEquals(1, count("\n"));
}

@Test
public void dataWithoutNewline() throws IOException {
    assertEquals(1, count("one"));
}

@Test
public void oneCompleteLine() throws IOException {
    assertEquals(1, count("one\n"));
}

@Test
public void twoCompleteLines() throws IOException {
    assertEquals(2, count("one\ntwo\n"));
}

@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
    assertEquals(2, count("one\ntwo"));
}

@Test
public void aFewLines() throws IOException {
    assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}

And it looks like this:

static long countLines(InputStream is) throws IOException {
    try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
        char[] buf = new char[8192];
        int n, previousN = -1;
        //Read will return at least one byte, no need to buffer more
        while((n = lnr.read(buf)) != -1) {
            previousN = n;
        }
        int ln = lnr.getLineNumber();
        if (previousN == -1) {
            //No data read at all, i.e file was empty
            return 0;
        } else {
            char lastChar = buf[previousN - 1];
            if (lastChar == '\n' || lastChar == '\r') {
                //Ending with newline, deduct one
                return ln;
            }
        }
        //normal case, return line number + 1
        return ln + 1;
    }
}

If you want intuitive results, you may use this. If you just want wc -l compatibility, simple use @er.vikas solution, but don't add one to the result and retry the skip:

try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
    while(lnr.skip(Long.MAX_VALUE) > 0){};
    return lnr.getLineNumber();
}
Alexander Torstling
  • 18,552
  • 7
  • 62
  • 74
2

How about using the Process class from within Java code? And then reading the output of the command.

Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();

BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
    System.out.println(line);
    lineCount = Integer.parseInt(line);
}

Need to try it though. Will post the results.

1

This funny solution works really good actually!

public static int countLines(File input) throws IOException {
    try (InputStream is = new FileInputStream(input)) {
        int count = 1;
        for (int aChar = 0; aChar != -1;aChar = is.read())
            count += aChar == '\n' ? 1 : 0;
        return count;
    }
}
Ilya Gazman
  • 31,250
  • 24
  • 137
  • 216
0

On Unix-based systems, use the wc command on the command-line.

Peter Hilton
  • 17,211
  • 6
  • 50
  • 75
0

Only way to know how many lines there are in file is to count them. You can of course create a metric from your data giving you an average length of one line and then get the file size and divide that with avg. length but that won't be accurate.

Esko
  • 29,022
  • 11
  • 55
  • 82
  • 1
    Interesting downvote, no matter what command line tool you're using they all DO THE SAME THING anyway, only internally. There's no magic way to figure out the number of lines, they have to be counted by hand. Sure it can be saved as metadata but that's a whole another story... – Esko Jan 17 '09 at 09:27
0

If you don't have any index structures, you'll not get around the reading of the complete file. But you can optimize it by avoiding to read it line by line and use a regex to match all line terminators.

David Schmitt
  • 58,259
  • 26
  • 121
  • 165
  • Sounds like a neat idea. Anyone tried it and has a regexp for it? – willcodejavaforfood Jan 17 '09 at 11:02
  • 2
    I doubt it is such a good idea: it will need to read the whole file at once (martinus avoids this) and regexes are overkill (and slower) for such usage (simple search of fixed char(s)). – PhiLho Jan 17 '09 at 11:31
  • @will: what about /\n/ ? @PhiLo: Regex Executors are highly-tuned performance machines. Except the read-everything-into-memory caveat, I don't think that a manual implementation can be faster. – David Schmitt May 17 '11 at 11:37
0

Best Optimized code for multi line files having no newline('\n') character at EOF.

/**
 * 
 * @param filename
 * @return
 * @throws IOException
 */
public static int countLines(String filename) throws IOException {
    int count = 0;
    boolean empty = true;
    FileInputStream fis = null;
    InputStream is = null;
    try {
        fis = new FileInputStream(filename);
        is = new BufferedInputStream(fis);
        byte[] c = new byte[1024];
        int readChars = 0;
        boolean isLine = false;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if ( c[i] == '\n' ) {
                    isLine = false;
                    ++count;
                }else if(!isLine && c[i] != '\n' && c[i] != '\r'){   //Case to handle line count where no New Line character present at EOF
                    isLine = true;
                }
            }
        }
        if(isLine){
            ++count;
        }
    }catch(IOException e){
        e.printStackTrace();
    }finally {
        if(is != null){
            is.close();    
        }
        if(fis != null){
            fis.close();    
        }
    }
    LOG.info("count: "+count);
    return (count == 0 && !empty) ? 1 : count;
}
Pramod Yadav
  • 81
  • 1
  • 4
0

Scanner with regex:

public int getLineCount() {
    Scanner fileScanner = null;
    int lineCount = 0;
    Pattern lineEndPattern = Pattern.compile("(?m)$");  
    try {
        fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
        while (fileScanner.hasNext()) {
            fileScanner.next();
            ++lineCount;
        }   
    }catch(FileNotFoundException e) {
        e.printStackTrace();
        return lineCount;
    }
    fileScanner.close();
    return lineCount;
}

Haven't clocked it.

user176692
  • 780
  • 1
  • 6
  • 21
0

It seems that there are a few different approaches you can take with LineNumberReader.

I did this:

int lines = 0;

FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);

String line = count.readLine();

if(count.ready())
{
    while(line != null) {
        lines = count.getLineNumber();
        line = count.readLine();
    }
    
    lines+=1;
}
    
count.close();

System.out.println(lines);

Even more simply, you can use the Java BufferedReader lines() Method to return a stream of the elements, and then use the Stream count() method to count all of the elements. Then simply add one to the output to get the number of rows in the text file.

As example:

FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);

int lines = (int)count.lines().count() + 1;
    
count.close();

System.out.println(lines);
Conor
  • 327
  • 6
  • 15
-2

if you use this

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
    int cnt = 0;
    String lineRead = "";
    while ((lineRead = reader.readLine()) != null) {}

    cnt = reader.getLineNumber(); 
    reader.close();
    return cnt;
}

you cant run to big num rows, likes 100K rows, because return from reader.getLineNumber is int. you need long type of data to process maximum rows..

thkala
  • 84,049
  • 23
  • 157
  • 201
Faisal
  • 17
  • 15
    An `int` can hold values of up to, approximately, 2 billion. If you are loading a file with more than 2 billion lines, you have an overflow problem. That said, if you are loading an unindexed text file with more than two billion lines, you probably have other problems. – Adam Norberg Jun 02 '11 at 21:26