1

I need to write a program that generates N random numbers and write them to binary file in descending order. It should be done without using any of sorting algorithms that use main memory. This is what I've done so far:

#include <iostream>
#include <fstream> 
#include <ctime>
#include <cstdlib>

using namespace std;
int main () {
  srand(time(0));
  rand();
  int N;
  do{
    cout << "Unesite N: ";
    cin >> N;
    } while(N<=0);

  ofstream br("broj.dat", ios::binary | ios::trunc);

  for(int i = 0; i<N; i++){
    int a = rand();
    br.write((char *)&a, sizeof(a));
  }
  br.close();

  return 0;
}

So, I've generated random numbers and wrote them to binary file but I don't know how to sort it.

Kris Bob
  • 11
  • 4
  • What do you mean by main memory? – François Andrieux Sep 05 '17 at 15:12
  • I should not insert that numbers in an array and sort it with algorithms like insertion sort or bubble sort. – Kris Bob Sep 05 '17 at 15:15
  • 1
    What you are looking for is called *External Merge Sort* and there is lots of information out there on how to accomplish this. Do note that they all use a buffer of some size otherwise you would need to make N files which I don't think you want to do. – NathanOliver Sep 05 '17 at 15:16
  • I've read something about external merge sort but I can't seem to find any C++ implementation. – Kris Bob Sep 05 '17 at 15:20
  • 1
    Found something about external sorting here on SO https://stackoverflow.com/questions/20802396/how-external-merge-sort-algorithm-works... and the links provided by the accepted answer are a good start – Robert Sep 05 '17 at 15:29
  • Thank you. This is assignment I have to solve with 5 more assignments in 1 hour. The other ones are even more difficult. I hope there is a simple way to solve this. – Kris Bob Sep 05 '17 at 15:43
  • 1
    @KrisBob I think that you're about to learn a lesson about procrastination. Programming is not to be done at the last minute, and debugging always takes twice as long as you think. At a minimum. – btilly Sep 05 '17 at 15:55
  • 1
    @NathanOliver A buffer reduces passes, but it is easy to write an external merge sort that has no buffer and never requires more than `O(log(n))` of files. The trick is to maintain a stack of files. Write 2 elements to a file, then recursively merge up the stack until the files are of different size. Keep doing that, then at the end recursively merge up the stack. – btilly Sep 05 '17 at 15:59
  • There is probably a distribution which allow to select numbers without the need to sort them afterward. – Jarod42 Sep 05 '17 at 15:59
  • @btilly I don't have to solve this today. This is one of the assignments in my exam on friday and even though I know which assignments will be there I still can't solve them for weeks and I only have an hour on the exam. – Kris Bob Sep 05 '17 at 16:06
  • 1
    This may be the time to think about the distribution of differences between those `N` numbers after sorting. Next, assume a dummy element of `std::numeric_limits::max()`, draw a difference from a suitable source, subtract from previous, output and repeat. (You need to adjust the source if getting to `std::numeric_limits::min()` too fast.) (This accomplishes the requirements of the first sentence without any relation to the `external sort`from the title.) – greybeard Sep 05 '17 at 16:07
  • @greybeard I'm not native english speaker so this is somewhat confusing to me, or maybe it doesn't even have to do something with my english knowledge. What do you mean by suitable source from which to draw a difference? – Kris Bob Sep 05 '17 at 16:17
  • `What do you mean by suitable source` a source of random numbers with the same distribution as the differences between successive random ints drawn from a uniform source - after sorting. – greybeard Sep 05 '17 at 16:32
  • Take random number `K0` from `[min, max]`, then repeatedly take number `K1` from `[min, K0[`. If you take uniform distribution for `K0`, then you would have more low numbers than with the sort method. – Jarod42 Sep 05 '17 at 16:35
  • Are you allowed temporary files? e.g., could you write each individual number to a file named after that number (the number 1 goes in file 00001.dat, 2 goes in 00002.dat, etc., and it's OK to put duplicate values in files), and then make your output file by simply reading (and deleting) each of those temp files in order? – Max Lybbert Sep 05 '17 at 18:48

3 Answers3

6

You can generate your numbers in sorted order in linear time. The paper describing how to do this is: Generating Sorted Lists of Random Numbers by Bentley & Saxe

https://pdfs.semanticscholar.org/2dbc/4e3f10b88832fcd5fb88d34b8fb0b0102000.pdf

/**
 * Generate an sorted list of random numbers sorted from 1 to 0, given the size
 * of the list being requested.
 * 
 * This is an implementation of an algorithm developed by Bentley and Sax, and
 * published in in ACM Transactions on Mathematical Software (v6, iss3, 1980) on
 * 'Generating Sorted Lists of Random Numbers'.
 */
public class SortedRandomDoubleGenerator {
    private long       valsFound;
    private double     curMax;
    private final long numVals;

    /**
     * Instantiate a generator of sorted random doubles.
     * 
     * @param numVals the size of the list of sorted random doubles to be
     *        generated
     */
    public SortedRandomDoubleGenerator(long numVals) {
        curMax = 1.0;
        valsFound = 0;
        this.numVals = numVals;
    }

    /**
     * @return the next random number, in descending order.
     */
    public double getNext() {
        curMax = curMax
                * Math.pow(Math.E, Math.log(RandomNumbers.nextDouble())
                        / (numVals - valsFound));
        valsFound++;
        return curMax;
    }
}
Dave
  • 7,460
  • 3
  • 26
  • 39
  • RandomNumbers can be the standard Java random number generator -- any way of getting pseudorandom numbers between 0-1. – Dave Sep 05 '17 at 18:22
  • Cool, but I think the point of his assignment is to learn how to sort the contents of an on-disk file without loading the file into memory first; writing out the file in already-sorted-order would miss the point of the assignment? – Jeremy Friesner Sep 05 '17 at 18:22
  • @JeremyFriesner I didn't notice this was homework. Still, this answers the question asked so I'll leave it up. – Dave Sep 05 '17 at 18:25
  • 2
    @JeremyFriesner: I don't see anything in the question to indicate that it requires sorting an on-disk file. It says, "generates N random numbers and write them to binary file in descending order." Nothing says write them to file and then sort them. – Jim Mischel Sep 05 '17 at 19:39
  • @JimMischel you could be right, but OTOH the title of the question is "random numbers external sort", which implies to me that the point of the exercise is implementing an external sort (i.e. a sort of on-disk data). I think the questioner didn't describe the requirements very well in his description. – Jeremy Friesner Sep 05 '17 at 21:31
  • 1
    @JeremyFriesner I'm sorry if it's not described well, the assignments consists of input and output. Input: int N; Output: binary file numbers.dat with N random numbers in descending order. (Do not use any of algorithms that use main memory) – Kris Bob Sep 06 '17 at 10:49
  • Nice answer! One note: the referenced article proposes the use of `exp()` and `log()` as a workaround for programming languages that do not have a more direct way to apply a fractional exponent. But in the programming language you chose (Java) and the OP's language (C++) this is not an issue. You could just do `curMax * Math.pow(Random.nextDouble(), 1 / (numVals - valsFound))` avoiding the `log` call. – trincot Nov 28 '22 at 15:18
0

Here is pseudo-code for how I'd do it.

for i in 1..N:
    write rand() to new file
    push onto file stack (new file, size=1)
    while 2 < len(file stack) and size of top two files the same:
        pop top two and merge them
        push onto file stack (merged file, size=new size)

while 2 < len(file stack):
    pop top two and merge them
    push onto file stack (merged file, size=new size)

The top of the file stack is your new sorted file.
btilly
  • 43,296
  • 3
  • 59
  • 88
  • @greybeard Nope, it is `O(log(n))` files because the merge stops when the files are different sizes.. Here are the stack of sizes that you wind up with after the first few iterations. `1`, `2`, `2 1`, `4`, `4 1`, `4 2`, `8`, ... – btilly Sep 07 '17 at 20:40
  • When I say `4 1` I mean a file with 4 elements on the bottom and a file with 1 element on the top. And yes, the stack size does work out to `O(log(n))`. The worst case is that you need `n` files to store `2^n-1` elements. (Then they all collapse down to a single merged file.) As for my post, it was mostly meant to clarify a comment that I left above to @NathanOliver describing this exact strategy. – btilly Sep 07 '17 at 23:17
  • *Finally* looked up NathanOliver's and your comment. Hell-bent on external sorting I'd distribute the numbers/runs to *few* (say, m) files (two would do) and do an m-way merge, again trying to distribute to m output files. Mileages vary. – greybeard Sep 09 '17 at 01:41
  • @greybeard If you're going to do a *real* external sort..the unix utility is a good starting point. It can be beat, for example the exact m to use on an m-way merge is going to depend on your hardware. And, of course, you should do a certain amount of sorting in memory before writing anything to disk. But if I just wanted an easy to write from scratch external sort? This is how I'd write it. – btilly Sep 10 '17 at 03:50
0

The standard library has a merge sort, but you need to use random access iterators. If you can use mmap (or its equivalent), you have random access iterators (yes, I know that you need to take COUNT from the command line):

#include <algorithm>
#include <cstdio>
#include <random>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

const size_t COUNT = 4096 * 4096;

int main()
{
    // create file (using mmap for simplicity)
    int fd = open("out.dat", O_RDWR | O_TRUNC | O_CREAT, S_IRUSR | S_IWUSR);
    if (fd < 0) {
        std::perror("open failed");
        return 1;
    }
    if (ftruncate(fd, COUNT * sizeof(unsigned)) != 0) {
        std::perror("ftruncate failed");
        close(fd);
        return 1;
    }
    void* mm = mmap(nullptr, COUNT * sizeof(unsigned), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (mm == MAP_FAILED) {
        std::perror("mmap failed");
        close(fd);
        return 1;
    }
    close(fd);

    // populate file
    unsigned* begin = static_cast<unsigned*>(mm);
    std::default_random_engine rng((std::random_device())());
    std::generate_n(begin, COUNT, rng);
    msync(mm, COUNT * sizeof(unsigned), MS_SYNC);
    std::puts("file written");

    // sort file
    std::stable_sort(begin, begin + COUNT);
    msync(mm, COUNT * sizeof(unsigned), MS_SYNC);
    std::puts("file sorted");

    if (std::is_sorted(begin, begin + COUNT)) {
        std::puts("it's properly sorted");
    }

    // close file
    munmap(mm, COUNT * sizeof(unsigned));
    return 0;
}

The msync calls aren't actually needed. I'm honestly surprised that this has decent performance.

Max Lybbert
  • 19,717
  • 4
  • 46
  • 69
  • 1
    `I'm honestly surprised` I wouldn't be. Kudos for another way to deal with *abysmal problem statement*. (Mind how you think about a paying customer's requirement description, though.) – greybeard Sep 07 '17 at 20:00