-4

how to write a Java class to sort 10 billion integers, assuming we can only fit a subset of them in memory at once.

I have done sorting but questions is how i would get the 1 billion values ?

How I am gonna sort them if i am going to load a portion of them in memory ?

If you can help me with source code it would be much appreciated.

Thanks in advance.

here is my last code, you can run it and guide me now.

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.Scanner;

/**
 * @project jqcontacts
 * @date Mar 26, 2012
 * @time 11:16:35 AM
 */
public class SortIntegers {

    private static String BIG_FILE="g:/bigFile.txt";
    private static String SORT_FILE_PREFIX = "g:/sortFile";
    /*private static final int SORT_FILE_MAX_SIZE = 1000000;*/
    private static final int SORT_FILE_MAX_SIZE = 10;
    private static final String MAIN_FILE =  "g:/rawfile1.txt";
    private static int RAWA_FILE_MAX_SIZE = 100;
    // if i keep the size of MERGE_BUFFER_INITIAL_SIZE = SORT_FILE_MAX_SIZE, the end big file is sorted.
    private static int MERGE_BUFFER_INITIAL_SIZE=5;
    private static int MERGE_BUFFER_SIZE_NEXT = MERGE_BUFFER_INITIAL_SIZE;
    private static int MERGE_BUFFER_SIZE_PREVIOUS = 0;

    private static int countFile = 0;

    public static void readFile(String name) throws FileNotFoundException{

        Scanner scanner = new Scanner(new File(name));
        List<Integer> intList = new ArrayList<Integer>();
         int fileSize = 0 ;

        while(scanner.hasNextInt()){
            intList.add(scanner.nextInt());
            ++fileSize;
            if(fileSize>=SORT_FILE_MAX_SIZE){
                Collections.sort(intList);
                /*System.out.println("list size: " + intList.size());*/
                String fileName = SORT_FILE_PREFIX + countFile +".txt";
                 ++fileSize;

                    PrintWriter out = openWriter(fileName);
                    for(int i:intList){
                          writeFile(i, out);
                    }

                    out.close();
                    intList.clear();
                    ++countFile;
                    fileSize = 0;
            }
        }

        System.out.println("done!");


    }


    public static List<Integer> readSortFile(String name, List<Integer> list) throws FileNotFoundException{

        Scanner scanner = new Scanner(new File(name));

        int bufferSize = 0;
        while(scanner.hasNextInt()){
            ++bufferSize;
            if(bufferSize>=MERGE_BUFFER_SIZE_PREVIOUS && bufferSize<=MERGE_BUFFER_SIZE_NEXT){
                list.add(scanner.nextInt());
            }

            if(bufferSize>=MERGE_BUFFER_SIZE_NEXT){
                break;
            }

            }


        Collections.sort(list);
        return list;
    }

     private static PrintWriter openWriter(String name) {
            try {
              File file = new File(name);
              PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(file)), true);
              return out;
            } catch (IOException e) {
              //System.out.println("I/O Error");
              e.printStackTrace();
              System.exit(0);
            }
            return null;
          }

      private static void writeFile(int i, PrintWriter out) {
           /* String line =  "0" + "\t" + Integer.toString(i);*/

          String line =  Integer.toString(i) + "\t";
            out.println(line);
          }

    /**
     * @param args
     */
    public static void main(String[] args) {

        generateRawIntFile();

            try {
                readFile(MAIN_FILE);
            } catch (FileNotFoundException e) {

                e.printStackTrace();
            }

            System.out.println("countFile: " + countFile);

            // merge sort here, merge the sorted files into one

            List<Integer> comboList = new ArrayList<Integer>();
            boolean isDone = true;
            PrintWriter outP = openWriter(BIG_FILE);

            while(isDone){

            for(int i=0;i<countFile;i++){

                try {
                    //TODO: do we need the return type for readSortFile ????
                    comboList = readSortFile(SORT_FILE_PREFIX+i+".txt", comboList);
                } catch (FileNotFoundException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }

            }

           // System.out.println("hun writing on big file    " + comboList.size());

            // add the list into bigfile and clear it for further processing

             try{

              for(int value:comboList){

                  writeFile(value, outP);
              }

              comboList.clear();


              MERGE_BUFFER_SIZE_PREVIOUS = MERGE_BUFFER_SIZE_NEXT;
              MERGE_BUFFER_SIZE_NEXT += MERGE_BUFFER_INITIAL_SIZE;

              System.out.println("MERGE_BUFFER_SIZE_PREVIOUS: " + MERGE_BUFFER_SIZE_PREVIOUS + " MERGE_BUFFER_SIZE_NEXT:" + MERGE_BUFFER_SIZE_NEXT);

              if(MERGE_BUFFER_SIZE_PREVIOUS >= RAWA_FILE_MAX_SIZE){
                  System.out.println("sorting is finished");
                  isDone = false;
                  break;
              }

             }catch (Exception e) {
                 e.printStackTrace();
            }



            }

    }

    /**
     * 
     */
    public static void generateRawIntFile() {

         Random randomGenerator = new Random();

          PrintWriter out = openWriter(MAIN_FILE);
            for (Integer i = 0; i < RAWA_FILE_MAX_SIZE;i++){
                Integer value = randomGenerator.nextInt(RAWA_FILE_MAX_SIZE);
                  writeFile(value, out);
            }
            out.close();
    }

}
user1430003
  • 29
  • 2
  • 8
  • 5
    Common interview question :) [One answer out of many out there](http://www.umbrant.com/blog/2011/external_sorting.html). – Ray Toal Mar 26 '12 at 05:51
  • Try starting with sorting within your subset (trivial), then sort your subsets amongst your other subsets (not so trivial) – hkf Mar 26 '12 at 05:52
  • 5
    http://en.wikipedia.org/wiki/External_sorting – TofuBeer Mar 26 '12 at 05:52
  • See StackOverflow answers [here](http://stackoverflow.com/a/3740620/960195) and [here](http://stackoverflow.com/q/8402106/960195). – Adam Mihalcin Mar 26 '12 at 05:56

1 Answers1

4

There are only 4 billion possible int values so the most efficient way of doing this, is to count the number of occurrences of any value. You can use a memory MappedByteBuffer so you don't have to have 16 GB of memory. Once you have counted all the occurrences the counts will naturally be in order, so no further sorting is required. The time complexity is O(n) instead of O(n * log n) line merge sort or quick sort.


import sun.nio.ch.DirectBuffer;

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class Sort10Billion {
    public static void main(String... args) throws IOException {
        Runtime runtime = Runtime.getRuntime();
        long used1 = runtime.totalMemory() - runtime.freeMemory();

        MassiveCounterStore mcs = new MassiveCounterStore();
        long start = System.nanoTime();
        long count = 10 * 1000 * 1000 * 1000L;
        for (long i = count; i > 0; i--)
            mcs.incrementIndex((int)  (i / 1019));
        mcs.iterator(new NumberCountFunction() {
            @Override
            public void counted(int n, long count) {
//                System.out.println(n + ": " + count);
            }
        });
        long time = System.nanoTime() - start;
        long used2 = runtime.totalMemory() - runtime.freeMemory();
        System.out.printf("Took %.1f seconds to sort %,d numbers, using %.3f MB%n", time / 1e9, count, (used2-used1)/1e6);
        mcs.close();
    }
}

interface NumberCountFunction {
    public void counted(int n, long count);
}

class MassiveCounterStore {
    public static final int PARTITION_BITS = 26;
    static final int PARTITIONS = (1 << (34 - PARTITION_BITS));  // 32-bit * 4 bytes.
    final MappedByteBuffer[] buffers = new MappedByteBuffer[PARTITIONS];
    final FileChannel channel;
    int smallest = PARTITIONS;
    int largest = 0;

    public MassiveCounterStore() throws IOException {
        File tmpStore = File.createTempFile("counter", "dat");
        tmpStore.deleteOnExit();

        channel = new RandomAccessFile(tmpStore, "rw").getChannel();
        for (int i = 0; i < PARTITIONS; i++)
            buffers[i] = channel.map(FileChannel.MapMode.READ_WRITE, (long) i << PARTITION_BITS, 1 << PARTITION_BITS);
    }

    public void incrementIndex(int n) {
        long l = (n + Integer.MIN_VALUE) & 0xFFFFFFFFL;
        int partition = (int) (l >> (PARTITION_BITS - 2)); // 4 bytes each.
        int index = (int) ((l << 2) & ((1 << PARTITION_BITS) - 1));
        MappedByteBuffer buffer = buffers[partition];
        int count = buffer.getInt(index);
        buffer.putInt(index, count + 1);
        if (smallest > partition) smallest = partition;
        if (largest < partition) largest = partition;
    }

    public void iterator(NumberCountFunction nfc) {
        int n = (smallest << (PARTITION_BITS -2)) + Integer.MIN_VALUE;
        for (int p = smallest; p <= largest; p++) {
            MappedByteBuffer buffer = buffers[p];
            for (int i = 0; i < 1 << PARTITION_BITS; i += 4) {
                int count = buffer.getInt(i);
                if (count != 0)
                    nfc.counted(n, count & 0xFFFFFFFFL);
                n++;
            }
        }
        assert n == Integer.MIN_VALUE;
    }

    public void close() {
        try {
            channel.close();
        } catch (IOException ignored) {
        }
        for (MappedByteBuffer buffer : buffers) {
            ((DirectBuffer) buffer).cleaner().clean();
        }
    }
}

prints when run with -XX:-UseTLAB (which gives you more accurate memory usage)

Took 150.7 seconds to sort 10,000,000,000 numbers, using 0.202 MB

I think using 202 KB is pretty good. ;)

Note: your performance is heavily dependant on the distribution of values as this impacts the efficiency of the cache.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • i have sorted the millions of integer records into 10MB sorted files. now i need to merge them and make a big sorted file. But problem is that i am using scanner to read files, now i can't read the files parallel and also like we read an array. i keep the index and so on. So due to this problem i am stuck, any idea how can i do this ??? – user1430003 Mar 26 '12 at 11:58
  • If you don't want to use Scanner, use a binary format. You can read the files in different threads, but if you use a binary format the CPU cost will be much lower and you won't have a problem. Not sure what the rest of your problems are... Have you considered doing what I suggested? Are your numbers 32-bit or 64-bit etc? – Peter Lawrey Mar 26 '12 at 12:02
  • it's a 32 bit, i am not getting any cpu issues, i was thinking about reading a file like 1000 integers from everyfile and adding them into an array and then again using Collection.sort to sort it and then writing into the file. I am so close now, I am using next and previous counters to keep the track of traversal of each file, as i already have kept the uniform size of the files. so it's gonna be done, i am working on this approach now, lets see, any suggestion and help in this regard would be highly appriciated – user1430003 Mar 26 '12 at 12:52
  • Its hard to imagine because your solution appears be rather complicated, can you add some sample code to your question? – Peter Lawrey Mar 26 '12 at 13:05
  • Nice one. But for guaranteed correctness, you need more than 32 bits for the counters. Or I'll give you 2^32 1s, 2^32 2s and the rest 3s on one day, 2^32 4s, 2^32 5s and the rest 3s the next. – Daniel Fischer Mar 26 '12 at 15:07
  • True, it will overflow. It could use 64-bit counters instead, but requires double the space. – Peter Lawrey Mar 26 '12 at 15:55
  • thank you peter for the code, but i am really going in different directions, i am in middle of storing small sorted files and storing into bigfile. thanks everyone for your comments and help, i really appriciate that – user1430003 Mar 27 '12 at 05:36
  • I guess I didn't see what your problem is, once you have a few hundred files you can merge sort them all to produce your result file. If you have thousands, you need to merge them into hundreds of files first. – Peter Lawrey Mar 27 '12 at 07:06
  • I am stuck now, I did able to sort them into multiple files, now when i merge them into 1, the problem is that how to do it ? in my case 1 iteration has sorted values in final while (which i am generating after 1st step discussed earlier) and again for other iteration, problem is that if we look at big file now, it has multiple sorted lists of files and it's not sorted. – user1430003 Mar 27 '12 at 10:41
  • You need to read the first number of each file. Find the smallest and write this value, replacing it with the next value from that file. This way you are always writing the next smallest value from any file. – Peter Lawrey Mar 27 '12 at 10:49
  • we can't replace a value in the file and can't use this approach, as file size increase say like 100 MB, then it wouldn't be possible to read it and then compare it with new value and adjust the values of that file. I think i would be requiring to create more files to further sort the sorted files and so on... but here i am lost, you can run my code to see what i am saying, it will generate the files also, with this approach i can easily sort like 100000000 values keeping the buffer of 1000000. – user1430003 Mar 27 '12 at 11:07
  • You replace the value you last read in memory. You only need to read one value at a time from each file regardless of the size of the file. BTW: If you are concerned about memory consumption a `List` is a poor choice as it uses about 5x as much memory as a `int[]` – Peter Lawrey Mar 27 '12 at 11:25
  • i am not using memory, i am reading values from file say 10 at a time from 10 different files and merging them and sorting them and putting them into a big file. now what do you say ? – user1430003 Mar 27 '12 at 11:44
  • The problem with doing this is that you can only write out 10 values at a time, you can't write out 100 values. e.g. say all the smallest values where in one file already. reading one at a time is much simpler to do. – Peter Lawrey Mar 27 '12 at 11:57
  • you mean to say that I first get the smallest value in the whole file and then traverse the whole file again, if the next read value is great then or equal to previous value, then I append next to that and so on, when my buffer reaches like 100000 records, I write that buffer in the file and then i continues this process until unless i find the largest value equal to my last value, if there is no more big value, i stop, but what if i have a end value say 10000000 and there are 5 records in the file, when i traverse the file, i will always find a next value equal to it, how to solve it ? thanks – user1430003 Mar 27 '12 at 12:04
  • You have N files with sorted data. You read first entry which will be the smallest (because you sorted the data), then you read the second as required etc. You only need to read each file once. – Peter Lawrey Mar 27 '12 at 12:23
  • not if you use BigInteger – Alan Deep Jan 24 '15 at 03:36