8

I'm trying to read large file(approximately 516mb), and it has 18 lines of text. I tried to write down the code myself and got an error in the first line of code while trying to read the file:

 try(BufferedReader br = new BufferedReader(new FileReader("test.txt"))) {
        String line;
        while ((line = br.readLine()) != null) {
            String fileContent = line;
        }
 }

Note: File exists and it's size is approximately 516mb. If there is another safer and faster method of reading from please tell me(Even if it will linebreaks). Edit: Here I tried by using Scanner, but it lasts a bit longer and then gives the same error

try(BufferedReader br = new BufferedReader(new FileReader("test.txt"))) {
    Scanner scanner = new Scanner(br);
    while(scanner.hasNext()){
        int index = Integer.parseInt(scanner.next());
        // and here do something with index
    }
}

I even splitted file into 1800 lines, but nothing got fixed

user3260312
  • 241
  • 1
  • 4
  • 9
  • 1
    Do you need to load the whole file in memory? – higuaro Apr 15 '15 at 07:59
  • @higuaro yes. I want to sort that file – user3260312 Apr 15 '15 at 08:01
  • @higuaro or is there a way to read that file separately by looping? – user3260312 Apr 15 '15 at 08:03
  • @user3260312 you have a file with `516M` and 18 lines which you want to sort? What type of text do you want to sort? – Uwe Plonus Apr 15 '15 at 08:07
  • @UwePlonus random numbers from 0-100 that are separated by space. I already know how to do it, but this OutOfMemoryError ruining my program – user3260312 Apr 15 '15 at 08:11
  • Actually, with that type of data, you can divide your file into several smaller files, and process each smaller file one by one, just use an array `data[101]` to count the frequency, and you have plenty of space. – Pham Trung Apr 15 '15 at 08:14
  • @higuaro I thought you would write some answer... – user3260312 Apr 15 '15 at 08:14
  • @PhamTrung yeah I've written that code for counting the frequency, but because of an error I can't continue)) – user3260312 Apr 15 '15 at 08:17
  • I wrote an answer, but didn't read about the data nature, I though the lines were totally random strings – higuaro Apr 15 '15 at 08:40
  • possible duplicate of [How do I sort very large files](http://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files) – Gnoupi Apr 15 '15 at 14:20
  • @Gnoupi WHAT!!! Have you read my question? – user3260312 Apr 15 '15 at 14:22
  • (If it wasn't for your comment about 1800 lines, my best guess was one of the lines being _much_ larger than 516m/18 characters, which _might_ be remedied constructing an integer `Stream` (without using `String` or `char`).) `do something with index` reads a tad unsettling: do you keep anything of `index`? _How_? – greybeard Apr 15 '15 at 19:20

6 Answers6

4

Using BufferedReader already help you to avoid loading the whole file into the memory. So, for further improvement, as you mentioned each number is separated by space, so instead of this:

line = br.readLine();

We can wrap the reader with a scanner,

Scanner scanner = new Scanner(br);

And extract each number in the file using scanner.next(); and store it into an integer array will also help to reduce memory usage:

int val = Integer.parseInt(scanner.next());

This will help you avoiding reading the whole sentence.

And you can also limit your buffer size for BufferedReader

BufferedReader br = new BufferedReader(new FileReader("test.txt") , 8*1024);

More information Does the Scanner class load the entire file into memory at once?

Community
  • 1
  • 1
Pham Trung
  • 11,204
  • 2
  • 24
  • 43
1

Increase you heap size witn -Xmx.

For your file I would suggest a setting of -Xmx1536m at least as a file size of 516M will increase while loading. Internally Jaava uses 16 bits to represent a character therefore a file with a text of 10 bytes will take approx. 20 bytes as String (except when using UTF-8 with many composed characters).

Uwe Plonus
  • 9,803
  • 4
  • 41
  • 48
  • Will it cause any problems or will it performance of my program slower? – user3260312 Apr 15 '15 at 08:09
  • @user3260312 As long as the computer has enough main memory there should be no problem with increasing the memory size. If you have not enough main memory then you have to search for another solution (independent of your programming language). – Uwe Plonus Apr 15 '15 at 08:11
  • Although not directly related - saying that internally Java uses 16 bits to represent a character is not entirely true. Java uses UTF-16 as a character encoding for Unicode; and not all Unicode characters can be mapped to 16 bit values meaning that there are some characters that require two 16 bit code units. – Random42 Apr 15 '15 at 08:13
  • @m3th0dman it is not correct, I know. But for practical purposes it is enough for a rough assumption to calculate the basic memory consumption... Also surrogate pairs are used seldom... – Uwe Plonus Apr 15 '15 at 08:17
1

EDIT It is the same for java heap space, declare variables inside or outside the loop.

Just an advice.

if you can, you shouldn't declare variables inside the loops, because of this, you can fill up the java heap space. In this example, if it were possible, it would be better:

try(BufferedReader br = new BufferedReader(new FileReader("test.txt"))) {
        String line;
        String fileContent;
        while ((line = br.readLine()) != null) {
            fileContent = line;
        }
 } 

Why? Because in each iteration java is reserving new space in heap for the same variable (Java is considering a new diferent variable (you might want this, but probably not)) and if the loop is big enough, the heap can be full.

maiklahoz
  • 135
  • 8
  • Not really, these variables are freed each time the while loop has done one cycle, so the gc will delete them. And the compiler is probably already optimizing this. – RaphMclee Apr 15 '15 at 12:04
  • Ok, thanks @RaphMclee I supossed that the gc only removes them when the loop is over. Thanks for the information. – maiklahoz Apr 15 '15 at 13:58
1

Java was designed to work with big amount of data which is bigger then available memory. On lover level API file is a stream, possibly endless.

However with chip memory people prefer easy way - read all to the memory and work with memory. Usually it works but not in your case. Increasing memory only hides this issue till you have bigger file. So, it's time to do it right.

I don't know your sorting approach what you use for comparison. If it is good one then it may produce some sortable key or index of each string. You read file once, create map pf such keys, sort them and then create sorted file based on this sorted map. That would be (worst case scenario) in your case 1+18 file readings plus 1 writing.

However if you don't have such key and simply compare strings character by character then you have to have 2 input streams and compare one to another. If one string is not in correct place then you rewrite file in correct order and do it again. Worst case scenario 18*18 readings to compare, 18*2 reading for writing and 18 writings.

That's the consequence for such architecture when you keep your data in huge strings in huge files.

Alex
  • 4,457
  • 2
  • 20
  • 59
0

Note: Increasing the heap memory limit to sort a file with 18 lines is just a lazy way to solve a programming problem, this philosophy of always increase the memory instead of solving the real problem is a reason of Java programs bad fame about slowness and such.

My advice, to avoid increasing the memory for such a task is to split the file by line and merge the lines in a way that resembles a MergeSort. This way your program can scale up if the file size grows.

To split the file in several "line sub files", use the read method of the BufferedReader class:

private void splitBigFile() throws IOException {
    // A 10 Mb buffer size is decent enough
    final int BUFFER_SIZE = 1024 * 1024 * 10; 

    try (BufferedReader br = new BufferedReader(new FileReader("test.txt"))) {
        String line;

        int fileIndex = 0;
        FileWriter currentSplitFile = new FileWriter(new File("test_split.txt." + fileIndex));

        char buffer[] = new char[BUFFER_SIZE]; 

        int readed = 0;
        while ((readed = br.read(buffer)) != -1) {
            // Inspect the buffer in search of the new line character
            boolean endLineProcessed = false;
            for (int i = 0; i < readed; i++) {
                if (buffer[i] == '\n') {
                    // This chunk contains the new line character, write this last chunk the current file and create a new one
                    currentSplitFile.write(buffer, 0, i);
                    fileIndex++;
                    currentSplitFile = new FileWriter(new File("test_split.txt." + fileIndex));
                    currentSplitFile.write(buffer, i, readed - i);
                    endLineProcessed = true;
                }
            }

            // If not end of line found, just write the chunk 
            if (!endLineProcessed) {
                currentSplitFile.write(buffer, 0, readed);
            }
        }
    }
}

To merge them, open all the files and keep a separate buffer (a small one, like 2 mb each) for every one of them, read the first chunk of every file and there you'll have enough information to start rearranging the index of the files. Keep reading chunks if some of the files have ties.

higuaro
  • 15,730
  • 4
  • 36
  • 43
  • 2
    "...is a reason of Java programs bad fame about slowness and such" - What you say is true, but is not just limited to Java programs...unfortunately. – Mr Moose Apr 15 '15 at 08:46
  • Even this solution has its limitation as a file with 516m and only 18 lines is huge and so even the splitted files have a reasonable size... – Uwe Plonus Apr 15 '15 at 09:09
  • It does not matter if the split files are not that small, once lines are separated they can be arranged using small buffers without loading any of the files completely in memory and the solution can scale for more lines. IMHO this is still more memory efficient that increasing the heap to load the whole file – higuaro Apr 15 '15 at 09:19
0

It's hard to guess without understanding of a memory profile of your application, your JVM settings and hardware. It could be as simple as just changing JVM memory settings or as hard as going with RandomFileAccess and convert bytes on your own. I'll try a long shot here. The problem can just lie with the fact that you are trying to read very long lines not with the fact the file is large.

If you look at the implementation of BufferedReader.readLine() you'll see something like this (simplified version):

String readLine() {
  StringBuffer sb = new StringBuffer(defaultStringBufferCapacity);  
  while (true) {
    if (endOfLine) return sb.toString();
     fillInternalBufferAndAdvancePointers(defaultCharBufferCapacity);//(*)
     sb.append(internalBuffer); //(**)
  }
}
// defaultStringBufferCapacity = 80, can't be changed 
// defaultCharBufferCapacity = 8*1024, can be altered

(*) Is the most critical line here. It tries to fill internal buffer of limited size 8K and append the char buffer to StringBuffer. 516Mb file with 18 lines means each line will occupy ~28Mb in memory. So it tries to allocate and copy 8K array ~3500 times per line.

(**)Then it tries to put this array into StringBuffer of default capacity 80. This causes extra allocations for StringBuffer to ensure it's internal buffer is large enough to keep the string ~ 25 extra allocations per line if I'm not mistaken.

So basically, I'd recommend to increase size of the internal buffer to 1Mb, just pass extra parameter to the instance of BufferedReader like:

 new BufferedReader(..., 1024*1024);
Andrey Taptunov
  • 9,367
  • 5
  • 31
  • 44