1

I am using the current function to read a large file and then distribute it to different shorter files. It takes 13 mins for a 100 MB file.

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;


public class DivideData {

public static void main(String[] args) throws IOException {
    Scanner data =  new Scanner(new File("D:\\P&G\\March Sample Data\\march.txt"));

    long startTime = System.currentTimeMillis();
    while(data.hasNextLine()){                          
        String line = data.nextLine();
        String[] split = line.split("\t");
        String filename = "D:\\P&G\\March Sample Data\\" + split[0] + " "+ split[1]+ ".txt";
        //System.out.println((filename));
        //System.out.println(line); 

        FileWriter fw = new FileWriter(filename,true); //the true will append the new data
        fw.write(line);//appends the string to the file
        fw.write('\n');
        fw.close();         

    }
    long stopTime = System.currentTimeMillis();
    System.out.println(stopTime - startTime);
    data.close();
    System.out.println("Data Scueessfully Divided!!");
}

}

I want to know what I can do to reduce the time it takes.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
nEO
  • 5,305
  • 3
  • 21
  • 25
  • Openning and closing files is very expensive, making small unbuffered writes is also expensive. Cache your files and use buffered writers and you should be able to write 100 MB in a few seconds. – Peter Lawrey Nov 05 '14 at 09:04
  • I have included a number of enhancements in my answer. – Peter Lawrey Nov 05 '14 at 09:17

4 Answers4

4

Move the FileWriter open and close outside the loop,

FileWriter fw = new FileWriter(filename,true); // <-- here!
while(data.hasNextLine()){                          
    String line = data.nextLine();
    String[] split = line.split("\t");
    String filename = "D:\\P&G\\March Sample Data\\" + split[0] + " "
            + split[1]+ ".txt";
    //System.out.println((filename));
    //System.out.println(line); 
    // FileWriter fw = new FileWriter(filename,true);

Otherwise it has to open the file and seek to the end for every line of input!

Edit

I noticed you don't have the filename until in your loop. Let's use a Map to keep a cache.

FileWriter fw = null;
Map<String, FileWriter> map = new HashMap<>();
while (data.hasNextLine()) {
    String line = data.nextLine();
    String[] split = line.split("\t");
    String filename = "D:\\P&G\\March Sample Data\\" + split[0] + " "
            + split[1] + ".txt";
    // System.out.println((filename));
    // System.out.println(line);
    if (map.containsKey(filename)) {
        fw = map.get(filename);
    } else {
        fw = new FileWriter(filename, true);
        map.put(filename, fw);
    }
    // ...
}
for (FileWriter file : map.values()) {
    file.close();
}
Elliott Frisch
  • 198,278
  • 20
  • 158
  • 249
  • But I need that "filename" variable coz I am using that as the filename. If I declare that before the while loop, it will give me an error. – nEO Nov 05 '14 at 07:47
  • @nEO I noticed that after I posted. Sorry. Edited. – Elliott Frisch Nov 05 '14 at 07:48
  • Thanks Eliott. That worked out to be great!! And is pretty fast. WIll learn more about map and hash. – nEO Nov 05 '14 at 07:57
  • how long does it take with this method? – İsmet Alkan Nov 05 '14 at 08:00
  • @nEO This particular use is a form of [memoization](http://en.wikipedia.org/wiki/Memoization) optimization. – Elliott Frisch Nov 05 '14 at 08:05
  • 1
    @IsThatSo took me 7 seconds instead of 13 mins. The problem was with opening and closing the files each time. Moreover, to append the pointer had to find the end of file and then append. – nEO Nov 05 '14 at 08:23
  • @nEO This business about 'find[ing] the end of the file' being a bottleneck is BS. The overhead is in the opening and closing. The end of the file is where the directory entry says it is, just like the beginning of the file. No finding required. As I predicted, this technique is two orders of magnitude faster. – user207421 Nov 05 '14 at 08:58
  • Got it. Thanks. I googled it and this was one of the things I got. – nEO Nov 05 '14 at 08:59
  • +1 As well as using BufferedWriters, you can use LinkedHashMap as an LRU cache to prevent running out of file descriptors. – Peter Lawrey Nov 05 '14 at 09:05
2

Similar to Elliot's solution. Performance enhancements in line.

Map<String, PrintWriter> map = new LinkedHashMap<String, PrintWriter>(128, 0.7f, true) {
    protected boolean removeEldestEntry(Map.Entry<String, PrintWriter> eldest) {
        if (size() > 200) {
            eldest.getValue().close();
            return true;
        }
        return false;
    }
};

while (data.hasNextLine()) {
    String line = data.nextLine();
    // only split the first two as that is all we need.
    String[] split = line.split("\t", 3);
    String filename = "D:\\P&G\\March Sample Data\\" + split[0] + " " + split[1] + ".txt";
    // get once, is faster than contains + get
    PrintWriter pw = map.get(filename);
    if (pw == null)
        map.put(filename, pw = new PrintWriter(new BufferedWriter(new FileWriter(filename))));
    // writing to a BufferedWriter is faster than flushing each line, 
    // unless the lines are very long.
    pw.println(line); // use system line separator.
}
for (Writer writer : map.values())
    writer.close();

This will be more efficient and won't run out of file descriptors.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
1

Don't open and close the file each time around the loop. Open it before and close it after. You will find this orders of magnitude faster.

user207421
  • 305,947
  • 44
  • 307
  • 483
0

Could you please just use BufferedReader & BufferedWriter to accomplish this? I think it could be faster.
And it seem that you will reopen the writer in the loop?
//Add: A larger heap size could be a great help to performance.

sanigo
  • 625
  • 4
  • 14