-1

I'm having a problem working with a 1.3 Gb CSV file (it contains 3 million rows). The problem is that I want to sort the file according to a field called "Timestamp" and I can't split the file into multiple reads because otherwise the sorting won't work properly. I get the following error at one point :

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

This is my code:

public class createCSV {
    public static BufferedReader br = null;
    public static String csvFile = "/Scrivania/dataset";
    public static String newcsvFile = "/Scrivania/ordinatedataset";
    public static String extFile = ".csv";
    
    public static void main(String[] args) {
        try {
            List<List<String>> csvLines = new ArrayList<>();
            br = new BufferedReader(new FileReader(csvFile+extFile));
            
            CSVWriter writer = new CSVWriter(new FileWriter(newcsvFile+extFile));
            
            String line = br.readLine();
            String[] fields = line.split(",");
            writer.writeNext(fields);
            line = br.readLine();
            while(line!=null) {
                csvLines.add(Arrays.asList(line.split(",")));           
                line = br.readLine();
            }
            
            csvLines.sort(new Comparator<List<String>>() {
                @Override
                public int compare(List<String> o1, List<String> o2) {
                    return o1.get(8).compareTo(o2.get(8));
                }
            });
            for(List<String>lin:csvLines){
                writer.writeNext(lin.toArray(new String[0]));
            }
            writer.close();
        }catch(IOException e) {
            e.printStackTrace();
        }
         
    }

}

I have tried increasing the heap size to the maximum, 2048, in particular: -Xms512M -Xmx2048M in Run->Run Configuratins but it still gives me an error. How could I solve and sort the whole file? Thank you in advance

sirss
  • 171
  • 1
  • 9
  • As you are using OpenCSV to write your file, you should also use it to read it, instead of using String.split which may get things wrong. – k314159 Mar 29 '21 at 17:23
  • How is 2GB the "maximum"? You can surely allocate a lot more if you have free memory, and at some point your program will work. If you need to squeeze under 2 GB because this is some kind of a challenge / homework, then enjoy :). That's the spirit of the question, to find how to merge without holding everything in memory. Hint: External merge sort. – Petr Janeček Mar 29 '21 at 17:29
  • 1
    There are command line tools for that : https://stackoverflow.com/questions/41423483/how-can-i-sort-a-very-large-csv-file – Turo Mar 29 '21 at 17:30
  • This may help you https://stackoverflow.com/questions/2356137/read-large-files-in-java – Mohamed Ismail M Mar 29 '21 at 17:48

2 Answers2

0

The approach of reading file with FileReader will keep data of file in-memory this leads to exhaustion of memory. What you need is streaming through file. You can achieve this with Scanner class of Apache commons library.

With Scanner:

List<List<String>> csvLines = new ArrayList<>();
FileInputStream inputStream = null;
Scanner sc = null;
try {
    inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        csvLines.add(Arrays.asList(line.split(",")));   
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
} finally {
    if (inputStream != null) {
        inputStream.close();
    }
    if (sc != null) {
        sc.close();
    }
}

Apache Commons:

LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
} finally {
    LineIterator.closeQuietly(it);
}
hiren
  • 1,742
  • 13
  • 20
  • I don't understand how this solution avoids the need to have the whole result in memory. Isn't this code building up the entire result in memory in `csvLines`? In fact, you don't seem to ever write that structure to a file. Also, this is about sorting the file. Where are you doing any sorting? – CryptoFool Mar 29 '21 at 17:55
  • The writing logic I have not shown.. You can write it from the ArrayList by sorting it using comparator as you shown in problem and using FileOutputStream. – hiren Mar 29 '21 at 17:58
  • Writing it not the point. The point is that your approach seems to involve reading the entire file into memory, which is what the OP is asking how to avoid. – CryptoFool Mar 29 '21 at 20:52
  • I understand your point but reading file with InputStream is not putting whole file in memory. I have not tested this by myself but you may find this link useful: https://www.baeldung.com/java-read-lines-large-file – hiren Mar 30 '21 at 04:40
0

Hopefully you can find an existing library that will do this for you, or use a command line tool called from Java to do this instead. If you need to code this yourself, here's a suggestion as to a pretty simple approach you might code up...

There's a simple general approach to sorting a large file like this. I call it a "shard sort". Here's what you do:

Pick a number N that is the number of shards you'll have and a function that will produce a value between 1 and N for each input entry such that you get roughly the same number of entries in each shard. For example, you could choose N to be 10, and you could use the seconds part of your timestamp and have the shard id be id = seconds % 10. This should "randomly" spread your entries across the 10 shards.

Now open the input file and 10 output files, one for each shard. Read each entry from the input file, compute its shard id, and write it to the file for that shard id.

Now read each shard file into memory, sort it bases on on each entry's timestamp, and write it back out to file. For this example, this will take 10% of the memory needed to sort the whole file.

Now open the 10 shard files for reading and a new result file to contain the final result. Read in the next entry in all 10 input files. Write out the earliest entry timestamp-wise of those 10 entries to the output file. When you write out a value, you read a new one from the shard file it came from. Repeat this process this until all the shard files are empty and all entries in memory have been written.

If your file is so big that 10 shards isn't enough, use more. You could, for example, use 60 shard files and use the entire seconds value from your timestamp for the shard id.

CryptoFool
  • 21,719
  • 5
  • 26
  • 44