Java Fastest way to read through text file with 2 million lines

Question

Currently I am using scanner/filereader and using while hasnextline. I think this method is not highly efficient. Is there any other method to read file with the similar functionality of this?

public void Read(String file) {
        Scanner sc = null;


        try {
            sc = new Scanner(new FileReader(file));

            while (sc.hasNextLine()) {
                String text = sc.nextLine();
                String[] file_Array = text.split(" ", 3);

                if (file_Array[0].equalsIgnoreCase("case")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("object")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("classes")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("function")) {
                    //do something
                } 
                else if (file_Array[0].equalsIgnoreCase("ignore")) {
                    //do something
                }
                else if (file_Array[0].equalsIgnoreCase("display")) {
                    //do something
                }
            }

        } catch (FileNotFoundException e) {
            System.out.println("Input file " + file + " not found");
            System.exit(1);
        } finally {
            sc.close();
        }
    }

This [link](http://www.geeksforgeeks.org/fast-io-in-java-in-competitive-programming/) has some good solutions — Johny, Feb 22 '17 at 19:12

score 44 · Accepted Answer · edited Dec 07 '16 at 19:55

44

You will find that BufferedReader.readLine() is as fast as you need: you can read millions of lines a second with it. It is more probable that your string splitting and handling is causing whatever performance problems you are encountering.

edited Dec 07 '16 at 19:55

Nathan Davis

5,636
27
39

answered Oct 21 '13 at 04:43

user207421

305,947
44
307
483

I didnt do a time check but when i use bufferedreader, i think the reading part is about 20% faster compared to scanner – BeyondProgrammer Oct 21 '13 at 06:27
5

In my case, the splitting was the most dominant factor in the file read. Simple use of indexOf/lastIndexOf and substring helped cut those costs to a bare minimum. – lalitm Apr 14 '14 at 06:07
1

For me also the cost got reduced by around 50% once I replaced `split()` with `substring()`-`indexOf()` pair. – Vikas Prasad Sep 03 '17 at 18:01

YAMM · Answer 2 · 2021-02-25T12:39:05.370

I made a gist comparing different methods:

import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Scanner;
import java.util.function.Function;

public class Main {

    public static void main(String[] args) {

        String path = "resources/testfile.txt";
        measureTime("BufferedReader.readLine() into LinkedList", Main::bufferReaderToLinkedList, path);
        measureTime("BufferedReader.readLine() into ArrayList", Main::bufferReaderToArrayList, path);
        measureTime("Files.readAllLines()", Main::readAllLines, path);
        measureTime("Scanner.nextLine() into ArrayList", Main::scannerArrayList, path);
        measureTime("Scanner.nextLine() into LinkedList", Main::scannerLinkedList, path);
        measureTime("RandomAccessFile.readLine() into ArrayList", Main::randomAccessFileArrayList, path);
        measureTime("RandomAccessFile.readLine() into LinkedList", Main::randomAccessFileLinkedList, path);
        System.out.println("-----------------------------------------------------------");
    }

    private static void measureTime(String name, Function<String, List<String>> fn, String path) {
        System.out.println("-----------------------------------------------------------");
        System.out.println("run: " + name);
        long startTime = System.nanoTime();
        List<String> l = fn.apply(path);
        long estimatedTime = System.nanoTime() - startTime;
        System.out.println("lines: " + l.size());
        System.out.println("estimatedTime: " + estimatedTime / 1_000_000_000.);
    }

    private static List<String> bufferReaderToLinkedList(String path) {
        return bufferReaderToList(path, new LinkedList<>());
    }

    private static List<String> bufferReaderToArrayList(String path) {
        return bufferReaderToList(path, new ArrayList<>());
    }

    private static List<String> bufferReaderToList(String path, List<String> list) {
        try {
            final BufferedReader in = new BufferedReader(
                new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8));
            String line;
            while ((line = in.readLine()) != null) {
                list.add(line);
            }
            in.close();
        } catch (final IOException e) {
            e.printStackTrace();
        }
        return list;
    }

    private static List<String> readAllLines(String path) {
        try {
            return Files.readAllLines(Paths.get(path));
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

    private static List<String> randomAccessFileLinkedList(String path) {
        return randomAccessFile(path, new LinkedList<>());
    }

    private static List<String> randomAccessFileArrayList(String path) {
        return randomAccessFile(path, new ArrayList<>());
    }

    private static List<String> randomAccessFile(String path, List<String> list) {
        try {
            RandomAccessFile file = new RandomAccessFile(path, "r");
            String str;
            while ((str = file.readLine()) != null) {
                list.add(str);
            }
            file.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return list;
    }

    private static List<String> scannerLinkedList(String path) {
        return scanner(path, new LinkedList<>());
    }

    private static List<String> scannerArrayList(String path) {
        return scanner(path, new ArrayList<>());
    }

    private static List<String> scanner(String path, List<String> list) {
        try {
            Scanner scanner = new Scanner(new File(path));
            while (scanner.hasNextLine()) {
                list.add(scanner.nextLine());
            }
            scanner.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return list;
    }


}

run: BufferedReader.readLine() into LinkedList, lines: 1000000, estimatedTime: 0.105118655

run: BufferedReader.readLine() into ArrayList, lines: 1000000, estimatedTime: 0.072696934

run: Files.readAllLines(), lines: 1000000, estimatedTime: 0.087753316

run: Scanner.nextLine() into ArrayList, lines: 1000000, estimatedTime: 0.743121734

run: Scanner.nextLine() into LinkedList, lines: 1000000, estimatedTime: 0.867049885

run: RandomAccessFile.readLine() into ArrayList, lines: 1000000, estimatedTime: 11.413323046

run: RandomAccessFile.readLine() into LinkedList, lines: 1000000, estimatedTime: 11.423862897

BufferedReader is the fastest, Files.readAllLines() is also acceptable, Scanner is slow due to regex, RandomAccessFile is inacceptable

Hey @YAMM, in your gist, the `System.out ("... into ArrayList")` is actually using a linkedList instead of arrayList. So means, buffer reading into ArrayList is the fastest. — imAmanRana, Feb 22 '21 at 05:30
thanks! I fixed it! I would also suggest to (almost) always use ArrayList since overall performance is just better. — YAMM, Feb 25 '21 at 12:37

score 9 · Answer 3 · edited Apr 12 '18 at 03:46

9

Scanner can't be as fast as BufferedReader, as it uses regular expressions for reading text files, which makes it slower compared to BufferedReader. By using BufferedReader you can read a block from a text file.

BufferedReader bf = new BufferedReader(new FileReader("FileName"));

you can next use readLine() to read from bf.

Hope it serves your purpose.

edited Apr 12 '18 at 03:46

Valdrinium

1,398
1
13
28

answered Jun 08 '15 at 14:16

shamsAAzad

133
2
10

2

I think you meant "Scanner can't be as fast as BufferedReader" – fIwJlxSzApHEZIl Apr 27 '17 at 20:48

Trying · Answer 4 · 2013-10-21T05:10:53.983

4

you can use FileChannel and ByteBuffer from JAVA NIO. ByteBuffer size is the most critical part in reading data faster what i have observed. Below code will read the content of the file.

static public void main( String args[] ) throws Exception 
    {
        FileInputStream fileInputStream = new FileInputStream(
                                        new File("sample4.txt"));
        FileChannel fileChannel = fileInputStream.getChannel();
        ByteBuffer byteBuffer = ByteBuffer.allocate(1024);

        fileChannel.read(byteBuffer);
        byteBuffer.flip();
        int limit = byteBuffer.limit();
        while(limit>0)
        {
            System.out.print((char)byteBuffer.get());
            limit--;
        }

        fileChannel.close();
    }

You can check for '\n' for new line here. Thanks.

Even you can scatter and getter way to read files faster i.e.

fileChannel.get(buffers);

where

      ByteBuffer b1 = ByteBuffer.allocate(B1);
      ByteBuffer b2 = ByteBuffer.allocate(B2);
      ByteBuffer b3 = ByteBuffer.allocate(B3);

      ByteBuffer[] buffers = {b1, b2, b3};

This saves the user process to from making several system calls (which can be expensive) and allows kernel to optimize handling of the data because it has information about the total transfer, If multiple CPUs available it may even be possible to fill and drain several buffers simultaneously.

From this book.

edited Oct 21 '13 at 05:10

answered Oct 21 '13 at 04:54

Trying

14,004
9
70
110

1

A direct byte buffer is of no benefit if the data is being read into the Java side of the JVM. Its benefit comes if you're just copying the data between two channels without looking at it in the Java code. – user207421 Oct 21 '13 at 04:55
@EJP i know. I deleted here the line and your comment came. :-) – Trying Oct 21 '13 at 04:57
@Trying , I would like to try using FileChannel could you provide me any example from my codes above? – BeyondProgrammer Oct 21 '13 at 05:06
It can't parallel read from a single disk unless it has multiple heads. There is nothing here that actually reads lines at all, so it really isn't an answer to the question at all. – user207421 Oct 21 '13 at 05:10
Not only i am reading the file, but i am searching for the words I want, using delimiter . Does this method work? if (file_Array[0].equalsIgnoreCase("case")) { //do something } – BeyondProgrammer Oct 21 '13 at 05:14
@user2822351 you can do this. – Trying Oct 21 '13 at 05:17
Your edited code doesn't convert `byte` to `char` correctly. The correct technique is to use a `CharsetDecoder.` – user207421 Oct 21 '13 at 05:21
@Trying Why? It's your answer. You're the one who's recommending NIO, so you're the one who is expected to know how to use it. The `CharsetDecoder` hint should be enough if you do. Apparently you don't. My answer is to use `BufferedReader.` – user207421 Oct 21 '13 at 05:29

score 3 · Answer 5 · answered Jun 22 '17 at 14:46

3

Use BufferedReader for high performance file access. But the default buffer size of 8192 bytes is often too small. For huge files you can increase the buffer size by orders of magnitudes to boost your file reading performance. For example:

BufferedReader br = new BufferedReader("file.dat", 1000 * 8192);
while ((thisLine = br.readLine()) != null) {
    System.out.println(thisLine);
}

answered Jun 22 '17 at 14:46

mac7

166
1
6

3

But it won't have much effect. 8192 is surprisingly adequate. – user207421 Jan 20 '18 at 02:50

score 2 · Answer 6 · answered Feb 13 '19 at 15:16

2

just updating this thread, now we have java 8 to do this job:

List<String> lines = Files.readAllLines(Paths.get(file_path);

answered Feb 13 '19 at 15:16

Digao

520
8
22

score 0 · Answer 7 · answered Oct 21 '13 at 05:07

0

You must investigate which part of program is taking time.

As per answer of EJP, you should use BufferedReader.

If really string processing is taking time, then you should consider using threads, one thread will read from file and queues lines. Other string processor threads will dequeue lines and process them. You will need to investigate how many threads to use, the number of threads you should use in application has to be related with number of cores in CPU, in that way will use full CPU.

answered Oct 21 '13 at 05:07

nullptr

3,320
7
35
68

If string processing is taking time, then multiple treads doing same thing will decrease time, right, like parallel processing. – nullptr Oct 21 '13 at 05:26
This will be usable only when processing of one line does not depend on processing of other line. – nullptr Oct 21 '13 at 05:27
1

If string processing is the bottleneck, putting it into a separate thread will only move the bottleneck. Not eliminate it. – user207421 Jul 15 '15 at 03:33
1

Bottleneck can be eliminated if processing is done in multiple threads parallely. – nullptr Jul 15 '15 at 09:14
1

Concurrency isn't always the solution. The actual problem was either the performance of Scanner, String.split() or equalsIgnoreCase (as it has to deep compare the strings). – RecursiveExceptionException Jun 13 '16 at 04:47
No, the bottleneck can be *distributed* if you process in multiple threads. You can't eliminate processing by distributing it. – user207421 Apr 25 '18 at 21:51

score 0 · Answer 8 · answered May 25 '22 at 16:37

You can read the file in chunks if there are millions of records. That will avoid potential memory issue. You need to keep last pointer to calculate offset of file.

try (FileReader reader = new FileReader(filePath);
                BufferedReader bufferedReader = new BufferedReader(reader);) {

            int pageOffset = lastOffset + counter;
            int skipRecords = (pageOffset - 1) * batchSize;

            bufferedReader.lines().skip(skipRecords).forEach(cline -> {
                try {
                    // PRINT
                    
                }

score -2 · Answer 9 · answered Oct 21 '13 at 05:11

If you wish to read all lines together then you should have a look at the Files API of java 7. Its really simple to use.

But a better approach would be to process this file in a batch. Have a reader which reads chunks of lines from the file and a writer which does the required processing or persists the data. Having abatch will ensure that it will work even if the lines increase to billion in future. Also you can have a batch which uses a multithreading to increase theoverall performance of the batch. I would recpmmend that you have a look at spring batch.

How exactly will a 'batch' help when he is reading and processing a line at a time? — user207421, Oct 21 '13 at 05:22

Java Fastest way to read through text file with 2 million lines

9 Answers9

Linked