0

I am trying to split a text file with multiple threads. The file is of 1 GB. I am reading the file by char. The Execution time is 24 min 54 seconds. Instead of reading a file by char is their any better way where I can reduce the execution time. I'm having a hard time figuring out an approach that will reduce the execution time. Please do suggest me also, if there is any other better way to split file with multiple threads. I am very new to java.

Any help will be appreciated. :)

    public static void main(String[] args) throws Exception {
        RandomAccessFile raf = new RandomAccessFile("D:\\sample\\file.txt", "r");
        long numSplits = 10;
        long sourceSize = raf.length();
        System.out.println("file length:" + sourceSize);
        long bytesPerSplit = sourceSize / numSplits;
        long remainingBytes = sourceSize % numSplits;

        int maxReadBufferSize = 9 * 1024;

        List<String> filePositionList = new ArrayList<String>();
        long startPosition = 0;
        long endPosition = bytesPerSplit;
        for (int i = 0; i < numSplits; i++) {
            raf.seek(endPosition);
            String strData = raf.readLine();
            if (strData != null) {
                endPosition = endPosition + strData.length();
            }
            String str = startPosition + "|" + endPosition;
            if (sourceSize > endPosition) {
                startPosition = endPosition;
                endPosition = startPosition + bytesPerSplit;
            } else {
                break;
            }
            filePositionList.add(str);
        }

        for (int i = 0; i < filePositionList.size(); i++) {

            String str = filePositionList.get(i);
            String[] strArr = str.split("\\|");
            String strStartPosition = strArr[0];
            String strEndPosition = strArr[1];
            long startPositionFile = Long.parseLong(strStartPosition);
            long endPositionFile = Long.parseLong(strEndPosition);
            MultithreadedSplit objMultithreadedSplit = new MultithreadedSplit(startPositionFile, endPositionFile);
            objMultithreadedSplit.start();
        }

        long endTime = System.currentTimeMillis();

        System.out.println("It took " + (endTime - startTime) + " milliseconds");
    }

}
public class MultithreadedSplit extends Thread {

    public static String filePath = "D:\\tenlakh\\file.txt";
    private int localCounter = 0;
    private long start;
    private long end;
    public static String outPath;

    List<String> result = new ArrayList<String>();

    public MultithreadedSplit(long startPos, long endPos) {
        start = startPos;
        end = endPos;
    }

    @Override
    public void run() {
        try {
            String threadName = Thread.currentThread().getName();

            long currentTime = System.currentTimeMillis();
            RandomAccessFile file = new RandomAccessFile("D:\\sample\\file.txt", "r");  
            String outFile = "out_" + threadName + ".txt";
            System.out.println("Thread Reading started for start:" + start + ";End:" + end+";threadname:"+threadName);
            FileOutputStream out2 = new FileOutputStream("D:\\sample\\" + outFile);
            file.seek(start);
            int nRecordCount = 0;

            char c = (char) file.read();
            StringBuilder objBuilder = new StringBuilder();
            int nCounter = 1;
            while (c != -1) {
                objBuilder.append(c);
                // System.out.println("char-->" + c);
                if (c == '\n') {
                    nRecordCount++;
                    out2.write(objBuilder.toString().getBytes());
                    objBuilder.delete(0, objBuilder.length());
                    //System.out.println("--->" + nRecordCount);
                    //      break;
                }
                c = (char) file.read();
                nCounter++;
                if (nCounter > end) {
                    break;
                }
            }
        } catch (Exception ex) {
           ex.printStackTrace();
        }

    }
}
bobah
  • 18,364
  • 2
  • 37
  • 70
  • 2
    Don't start with using multiple threads - the first thing you should do is to stop reading one char at a time. Read eg. 1MB in each disk access, then it will be much faster immediately. – deviantfan Sep 12 '18 at 05:26
  • 2
    Definitely do not need `RandomAccessFile` try some text based readers, see https://stackoverflow.com/questions/5868369/how-to-read-a-large-text-file-line-by-line-using-java – Scary Wombat Sep 12 '18 at 05:38
  • @deviantfan.. instead of reading by char.. what should i use where i can read 1MB in each disk.. i am very new to java.. i don't have much idea.. will you please help me out in some code? – farhana fatima Sep 12 '18 at 05:38
  • read my comment – Scary Wombat Sep 12 '18 at 05:43
  • @ScaryWombat.. Should I try with Java8 Strems?.. i don't have much idea will you please help me with some code.. its my task :(... – farhana fatima Sep 12 '18 at 06:29
  • This is nothing to do with java8 or streams - choose a different class to read in your data - obviously char-by-char will be slow - do you eat rice grain-by-grain or do you eat using a big mouth full? Which is quicker? Look at the answers in the link I gave you - there are some good examples – Scary Wombat Sep 12 '18 at 06:32
  • You should be able to read a text file at a rate of 50 - 100 MB/s or have it take 10 - 20 second with one thread. If it is taking much longer than this I would a) use the simplest way to read possible b) look at what else the program is doing. – Peter Lawrey Sep 12 '18 at 06:41

1 Answers1

0

The fastest way would be to map the file into memory segment by segment (mapping a large file as a whole may cause undesired side effects). It will skip few relatively expensive copy operations. The operating system will load file into RAM and JRE will expose it to your application as a view into an off-heap memory area in a form of a ByteBuffer. It would usually allow you to squeze last 2x/3x of the performance.

Memory-mapped way requires quite a bit of helper code (see the fragment in the bottom), it's not always the best tactical way. Instead, if your input is line-based and you just need reasonable performance (what you have now is probably not) then just do something like:

import java.nio.Files;
import java.nio.Paths;
...
File.lines(Paths.get("/path/to/the/file"), StandardCharsets.ISO_8859_1)
//      .parallel() // parallel processing is still possible
        .forEach(line -> { /* your code goes here */ });

For the contrast, a working example of the code for working with the file via memory mapping would look something like below. In case of fixed-size records (when segments can be selected precisely to match record boundaries) subsequent segments can be processed in parallel.

static ByteBuffer mapFileSegment(FileChannel fileChannel, long fileSize, long regionOffset, long segmentSize) throws IOException {
    long regionSize = min(segmentSize, fileSize - regionOffset);

    // small last region prevention
    final long remainingSize = fileSize - (regionOffset + regionSize);
    if (remainingSize < segmentSize / 2) {
        regionSize += remainingSize;
    }

    return fileChannel.map(FileChannel.MapMode.READ_ONLY, regionOffset, regionSize);
}

...

final ToIntFunction<ByteBuffer> consumer = ...
try (FileChannel fileChannel = FileChannel.open(Paths.get("/path/to/file", StandardOpenOption.READ)) {
    final long fileSize = fileChannel.size();

    long regionOffset = 0;
    while (regionOffset < fileSize) {
        final ByteBuffer regionBuffer = mapFileSegment(fileChannel, fileSize, regionOffset, segmentSize);
        while (regionBuffer.hasRemaining()) {
            final int usedBytes = consumer.applyAsInt(regionBuffer);
            if (usedBytes == 0)
                break;
        }
        regionOffset += regionBuffer.position();
    }
} catch (IOException ex) {
    throw new UncheckedIOException(ex);
}
bobah
  • 18,364
  • 2
  • 37
  • 70