3

I'm new to java ...in my current project I need to read and write a very huge text file (1 GB - 5 GB) ... first i used this classes : BufferedReader and BufferedWriter

public static String read(String dir) {
    BufferedReader br;
    String result = "", line;
    try {
        br = new BufferedReader(new InputStreamReader(new FileInputStream(dir), "UTF-8"));
        while ((line = br.readLine()) != null) {
            result += line + "\n";
        }
    } catch (IOException ex) {
        //do something
    }
    return result;
}

public static void write(String dir, String text) {
    BufferedWriter bw;
    try {
        bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dir), "UTF-8"));
        bw.write("");
        for (int i = 0; i < text.length(); i++) {
            if (text.charAt(i) != '\n') {
                bw.append(text.charAt(i));
            } else {
                bw.newLine();
            }
        }
        bw.flush();
    } catch (IOException ex) {
        //do something
    }
}

this classes works very good but not for Huge files...

then I used MappedByteBuffer for the read() method (I dont't know how to write a file using this class) :

public static String read(String dir) {
    FileChannel fc;
    String s = "";
    try {
        fc = new RandomAccessFile(dir, "r").getChannel();
        MappedByteBuffer buffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
        buffer.load();
        buffer.force();
        for (int i = 0; i < buffer.limit(); i++) {
            s += (char) buffer.get();
        } //I know the problem is here
        buffer.clear();
        inChannel.close();
    } catch (IOException e) {
        //do something
    }
    return s;
}

But still can't read large files(over 30-40 MB), even the NotePad is faster than my app :))

and also another problem is I don't know how to change encoding in second way(for example "UTF-8", "ANSI",...)

so guys, please tell me which is the best way to read and write laaaarge files? any idea?

AvB
  • 171
  • 1
  • 2
  • 12
  • 3
    What do you want to do with all that text? – fge Mar 18 '14 at 14:11
  • 1
    You simply should not read a 1-2GB file into a `String` - this is going to be a) slow and b) memory intensive. You probably need to carry out some transformation of the file so _stream_ it - read and write it one line at a time. More importantly I would recommend [this article](http://kaioa.com/node/59) on using strings in Java. – Boris the Spider Mar 18 '14 at 14:13
  • `try (InputStream in = new BufferedInputStream(new FileInputStream(dir))) { while (in.read() != -1); }` - it doesn't look you actually do anything with the data you've read so that should work for you, and very fast. – Erwin Bolwidt Mar 18 '14 at 14:14
  • to show it on a TextArea and do some searches and save results in a log file – AvB Mar 18 '14 at 14:15
  • Programs like Notepad will stream the file as you scroll and carry out all sorts of other optimisations. You cannot just dump 2GB of data into a `TextArea`. As far as searching goes, you will need to index the file somehow if it is that big. – Boris the Spider Mar 18 '14 at 14:16
  • Note that I recently started a project allowing to use large text files as `CharSequence`s so as to be able to use them with regexes... But it is alpha quality at the moment. Still, since you can read `CharSequence`s into strings it may be of some help. – fge Mar 18 '14 at 14:16
  • Arrays in Java cannot hold more than 2^31 elements - that's 2Gb. That includes String, StringBuilder, etc. (probably also TextArea). You need specially written components that split the data into multiple arrays if you want to hold more than 2^31 elements of some type. – Erwin Bolwidt Mar 18 '14 at 14:18
  • @fge, is your CharSequence project related to a regex question from a few weeks ago? – aliteralmind Mar 18 '14 at 14:18
  • @aliteralmind yes indeed. I could make it work over a 250 MB text file but it is slow at the moment (since regexes use very, very small subsequences and use .charAt() a lot) – fge Mar 18 '14 at 14:20
  • @fge: Would be interested in seeing it when you're done. – aliteralmind Mar 18 '14 at 14:22
  • What if you don't read line-by-line, and instead use a java.io.RandomAccesFile (for example), moving byte-per-byte and writing it in a disk while reading? Do you need a line-centric view of the file? By the way, are you just doing a copy&past of the file? – robermann Mar 18 '14 at 14:30
  • yes, i want to separate all lines – AvB Mar 18 '14 at 14:38
  • @AvB then read a pool of lines, then another pool etc – fge Mar 18 '14 at 14:39
  • variable-length lines @avb? – robermann Mar 18 '14 at 14:41
  • i would do a binary search of carriage returns, store their offset positions starting from the begining of the file, then show in the textarea a pool of x lines. when the user scroll down, i'd get the next pool (windowing). For this solution you need a random access to the file on disk. – robermann Mar 18 '14 at 14:56

4 Answers4

2
result += line + "\n";

this line tries to keep the entire file contents in memory. Try to process each line as you read it like this

while ((line = br.readLine()) != null) {
            processLine( line ); // this may write it to another file.
        }
sheu
  • 284
  • 5
  • 13
1

At the very least, I'd recommend changing

result += line + "\n";

to a StringBuilder.

resultBldr.append(line).append("\n");

This avoids creating a new string object--a bigger and bigger and bigger and bigger string object!--on each line.

Also, you should definitely write your output to the file line by line. Don't accumulate all that text and then output it.

In other words, in this situation, complete separation between your read and write functions is not recommended.

aliteralmind
  • 19,847
  • 17
  • 77
  • 108
0

Think that every contatenation of strings creates a new string, so, if you read every character of a big file of 40 MB and concatenate you are creating in total like 40.000.000 string in read().

Try to use StringBuffer instead of String, that is recomendable for this situations.

Narkha
  • 1,197
  • 2
  • 12
  • 30
0

Its always a bad idea to read large size files in the range of 1GB - 5GB in a single shot. There will be a huge performance over head and your app will slow down.

Its better to split this huge file into smaller chunks and read it chunk by chunk. i think if you start reading files in smaller chunks the code that you have written will work perfectly fine.

Have you heard about HDFS system, Solr indexing, apache hadoop frameworks which are specifically provided for manipulating huge data. you might want to have a look into it.

vikeng21
  • 543
  • 8
  • 28