8

I'm working with a very big text file (755Mb). I need to sort the lines (about 1890000) and then write them back in another file.

I already noticed that discussion that has a starting file really similar to mine: Sorting Lines Based on words in them as keys

The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!)

I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded..

I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong?

Any hints appreciated Thanks in advance

Community
  • 1
  • 1
Lucia Belardinelli
  • 727
  • 1
  • 11
  • 20
  • Well, "too long" depends on your expectations. If you hope doing it in half a second, it'll indeed be too long. If you don't mind waiting for a few seconds or minutes, it shouldn't be a problem. Try it, and see if the time is reasonable. – JB Nizet Jan 12 '12 at 09:45
  • You should be able to store the file in memory with about 1 GB of heap using the latest versions of Java. i.e. with `-XX:+UseCompressedStrings` – Peter Lawrey Jan 12 '12 at 09:53

6 Answers6

17

I think the solution here is to do a merge sort using temporary files:

  1. Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.

  2. Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.

  3. Delete all the temporary files.

This works with arbitrary large files, as long as you have enough disk space.

celtschk
  • 19,311
  • 3
  • 39
  • 64
2

You can run the following with

-mx1g -XX:+UseCompressedStrings  # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings  # on Java 6 update 29
-mx2g  # on Java 7 update 2.

import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class Main {
    public static void main(String... args) throws IOException {
        long start = System.nanoTime();
        generateFile("lines.txt", 755 * 1024 * 1024, 189000);

        List<String> lines = loadLines("lines.txt");

        System.out.println("Sorting file");
        Collections.sort(lines);
        System.out.println("... Sorted file");
        // save lines.
        long time = System.nanoTime() - start;
        System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
    }

    private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
        System.out.println("Creating file to load");
        int lineSize = size / lines;
        StringBuilder sb = new StringBuilder();
        while (sb.length() < lineSize) sb.append('-');
        String padding = sb.toString();

        PrintWriter pw = new PrintWriter(fileName);
        for (int i = 0; i < lines; i++) {
            String text = (i + padding).substring(0, lineSize);
            pw.println(text);
        }
        pw.close();
        System.out.println("... Created file to load");
    }

    private static List<String> loadLines(String fileName) throws IOException {
        System.out.println("Reading file");
        BufferedReader br = new BufferedReader(new FileReader(fileName));
        List<String> ret = new ArrayList<String>();
        String line;
        while ((line = br.readLine()) != null)
            ret.add(line);
        System.out.println("... Read file.");
        return ret;
    }
}

prints

Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Can you repeat the test using jdk7u2 to see how much memory and time it takes? – dogbane Jan 12 '12 at 10:42
  • Unfortunately Java 7 doesn't support this option http://stackoverflow.com/questions/8833385/is-support-for-compressed-strings-being-dropped – Peter Lawrey Jan 12 '12 at 10:46
  • Yep, but would still like to see how much memory it uses without the option. Maybe they have made improvements such that this option is no longer required. – dogbane Jan 12 '12 at 10:50
  • @dogbane A reasonable question, Java 7 needs 200 MB more than Java 6 with compressed Strings off. :] – Peter Lawrey Jan 12 '12 at 10:56
1

divide and conquer is the best solution :)

divide your file to smaller ones, sort each file seperately then regroup.

Links:

Sort a file with huge volume of data given memory constraint

http://hackerne.ws/item?id=1603381

Community
  • 1
  • 1
Adel Boutros
  • 10,205
  • 7
  • 55
  • 89
1

Algorithm:

How much memory do we have available? Let’s assume we have X MB of memory available.

  1. Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.

  2. Now bring the next chunk into memory and sort.

  3. Once we’re done, merge them one by one.

The above algorithm is also known as external sort. Step 3 is known as N-way merge

adarshr
  • 61,315
  • 23
  • 138
  • 167
e-zinc
  • 4,491
  • 2
  • 20
  • 17
0

Why don't you try multithreading and increasing heap size of the program you are running? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.)

javaCity
  • 4,288
  • 2
  • 25
  • 37
  • See comment left for Eric.Sun above. – Jaco Van Niekerk Jan 12 '12 at 09:53
  • Yes, your reason is obviously useful in very very large filesize. But the OP specified file size to be 755mb and most of the computers today have more than 755mb. Why use a complex algorithm if we can solve his/her problem with just -Xmx1024m? – javaCity Jan 12 '12 at 09:57
  • 1
    Merge sort is not a overly complex algorithm. I did not want to make assumptions on the hardware used by the algorithm. Also, the process may not be the only software running on the device. In my humble opinion writing a 50 lines of code to save more than a GB of memory (each line may take up several bytes, if it is string) is well worth the effort. (No offense intended.) – Jaco Van Niekerk Jan 12 '12 at 10:05
  • 1
    No, I agree with you whatsoever. If I was in the similar scenario, I'd have just tried increasing the heap size first. If that didn't work, I'd have probably done what you've suggested. Its all good :) – javaCity Jan 12 '12 at 10:07
  • Fair enough. Let's compromise on both solutions, i.e., try the memory approach and log a ticket to replace it with the more memory-efficient approach (with accompanied test cases) at a later stage (+1). – Jaco Van Niekerk Jan 12 '12 at 10:25
-2

Maybe u can use perl to format the file .and load into the database like mysql. it's so fast. and use the index to query the data. and write to another file.

u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks

Eric.Sun
  • 40
  • 1
  • Using a file-based merge sort is much better than simply allocating more memory. What happens if the file gets even bigger, i.e. 10gigs? – Jaco Van Niekerk Jan 12 '12 at 09:52