0

There're 200 million floats and maybe some are duplicates.

What's an efficient way (for example, with less than 1G memory) to get the rank for every element in them (they're unsorted at first)?

Like this:

Input: [3.2, 3.2, 3.4, 7.81, 1.0]

Output: [2, 2, 4, 5 ,1]

I think of the bitmap sort, but it looks not memory-efficient in this situation.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
Hanfei Sun
  • 45,281
  • 39
  • 129
  • 237
  • What exactly is the "rank" of an element? – orlp Aug 22 '12 at 21:05
  • 5
    You are unlikely to be able to do it with much less than 1.6G. Storing the floats themselves will take 200 million * 4 bytes = 800 MB, and then storing the output is another (at least) 800 MB. – huon Aug 22 '12 at 21:06
  • @dbaupp I assume that the values are stored on disk initially and the ranks will also be saved on disk, so it isn't necessarily true that the entire data set will ever need to be stored in memory. – bnaul Aug 22 '12 at 21:07
  • I think he means 'rank' is the index of the number in the sorted list, but with 2 equal values taking the rank of the first one in the output (3.2 both have rank 2 in his example. 1.0 has rank 1 because it's the smallest value.) – David Yaw Aug 22 '12 at 21:07
  • bitmap doesn't work: as 1) you have no idea how many times an element can occour (well, max 200 million times, but that doesn't help much) 2) floats occupy a continuous range. – Karoly Horvath Aug 22 '12 at 21:09
  • You might be able to first partition by the signs, and then the exponents; that can be done nearly in-place with 1KB of extra memory tops (an `int[256]`, to be precise) – Louis Wasserman Aug 22 '12 at 21:54
  • why the result is not `[2, 2, 3, 4, 1]`? – jfs Aug 22 '12 at 22:01

5 Answers5

1

I don't think you'll be able to do it all in 1G. Note that your 200 Mvalue dataset would take ~763 MiB, leaving only ~261 MiB available for auxiliary data. This rules out any approach that requires you to store the indices at the same time as the values, since an index into 200 Mvalues would take at least 28 bits. Practically, you'd really want 32 bits, which would take the same space as the original (presumably 32-bit) floating point values.

One approach to consider would be to perform a sort on the original data while logging the decision information to a bitmap, then replacing the original data with rank indices and reversing the permutation using the log.

However, the resulting permutation would require at least log2(N!) > N log2(N) - N log2(e) bits of storage in the worst case (so there's no way to get around it by using a radix sort or something). For the specified problem, note that log2(200M)>27 so the log could require more than (200M * 25.5) / (8bits/byte) ~ 608 MiB -- almost as large as the original dataset, and far larger than the specified auxiliary space.

You could write the decision log to disk, and reread it to generate your answer. But if you're allowing disk I/O, you might as well do an external sort instead, which would allow you to solve problems much larger than your memory could hold.

comingstorm
  • 25,557
  • 3
  • 43
  • 67
0

You don't want to sort the array, but you want to get an array of indices where the positions would be after sorting. It'd take more than your 1 GB of memory, and you'd probably have to do some post-processing to make equal elements have the same rank, but you should be able to use this solution as a beginning point: Get the indices of an array after sorting?

Community
  • 1
  • 1
David Yaw
  • 27,383
  • 4
  • 60
  • 93
0

You can sort ranges of floats based on their int values e.g. Float.floatToRawInt(float).

If you have 1 GB and you store 8 bytes per value you can sort groups of up to 128 million or 2^27 values. This means you will be able to rank them all with 2^5 or 32 passes.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • The order is not the same: floats: `-2.,-1.,0.,1.,2.` -> raw ints: `-1073741824, -1082130432, 0, 1065353216, 1073741824` – jfs Aug 22 '12 at 21:32
  • @J.F.Sebastian I left it to the OP's imagination on how to sort the groups. If you group by lower 27 bits at a time, they will all be negative or positive. – Peter Lawrey Aug 23 '12 at 07:16
0

You could try to do External sorting as described on wikipedia.

Try to use a memory mapped file when dealing with the float data.

public static void main(String[] args) throws IOException {
    RandomAccessFile raf = new RandomAccessFile("floats.dat", "rw");
    FileChannel fc = raf.getChannel();
    MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_WRITE, 0, 1024 * 1024 * 1024);
    FloatBuffer fb = mbb.asFloatBuffer();
    Random random = new Random();
    for (int i = 0; i < 200000000; i++) {
        float rand = random.nextFloat();
        fb.put(rand);
    }
    fb.flip();

    // Read data in chunks, tune the size
    float[] f = new float[100000];
    fb.get(f, 0, f.length);

    // Process the data using some merge strategy
}

As I understand the float array itself should not be sorted. Store the int array using memory mapped file as well.

maba
  • 47,113
  • 10
  • 108
  • 118
0

If you are using the standard Java sorting method and an array of floats you are fine with ~1.2GB IMO as it already uses a very space efficient and fast (n lg(n)) sorting method (TimSort, MergeSort) - see Arrays.sort.

To make it even faster you could convert the floats to integers (but you'll need to know precision up front) and then apply integer sort or the already mentioned radix sort.

Karussell
  • 17,085
  • 16
  • 97
  • 197