3

First, a little context. I am an electrical engineering student minoring in compsci and I am pretty much entirely self taught with very little rigorous training in programming, so there is likely going to be stuff that isn't "standard" in my code below.

This little program is just a utility to generate an arbitrary size data file filled with randomly generated signed int values to be used as input for an assignment. I have completed the assignment and it works fine. This is a question about something strange (to me) that happened in this program that only started happening around the time that I added the section of code to handle checking for duplicates. Before, it just dumped the ints straight to a file, one per line. Then, I realized that wasn't strictly the way the professor's data would be formatted, so I changed it to prevent duplicates, add more than one int per line, and add additional whitespace delimiters (space, tab, newline).

Ok, with all of that said this works as long as I keep MAX_NUMBERS at about 32k or lower. If I make it higher, it displays the count nice and quickly until about 32k or so and then it slows way down for a couple of hundred or so and then abruptly hangs at 32768. Due to this number, I thought it might have to do with the size of an int (using codeblocks with the ming compiler), but sizeof(int) shows that it is 4 bytes, so that shouldn't cause it. Also thought maybe I was hitting a max limit on the number of indices on an array since before it wasn't using an array. My research indicates that this shouldn't be the cause though. I know it's going to slow way down as the number of values that it has to check for duplicates goes way up, but I'm confused why it just abruptly stops.

Lastly, I did try modifying it to use a larger C99 datatype instead of int, just as an experiment, but that did nothing.

If anyone happens to see anything dumb, besides using an array haha, please let me know! This is driving me a bit crazy.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>


int main()
{
    const int MAX_NUMBERS = 32000; // don't go higher than about 32000
    int* arr;
    // arr is used for duplicate checking, a log of everything put into the file is recorded
    // in arr and checked against to ensure uniqueness.

    const int ALLOW_NEG = 1; // switch to choose whether to allow negative numbers or not.

    int x = 0; // the random number that was generated
    int index = 0; // main loop control
    int index2 = 0; // dupe check loop control
    int hpos = 1; // used to select which type of whitespace to add
    int uniNum = 1; // uniqueness flag
    FILE *f = fopen("nums.txt", "w"); // open the file for writing. creates it if it's not there.

    arr = calloc(MAX_NUMBERS, sizeof(int)); // allocate space for the array

    for (index = 0; index < MAX_NUMBERS; index++) // arr init loop
        arr[index] = 999999999; // init the array to an invalid value. initially was 0, but caused 0 to be omitted by the dupe checker


    if (f == NULL){ // sanity check for the file
        printf("Error: Unable to open file. Program aborting.\n");
        exit(1);
    }

    printf("Generating data file...\n");

    srand(time(NULL)); // seed the random number generator

    fprintf(f, "%d\n", MAX_NUMBERS); // write the first line, the total number of ints in the file

    for (index = 0; index < MAX_NUMBERS; index++) { // main loop
        printf("\r%d", index); // just a display of the indices as the loops running, useless for small counts, semi-useful for very large amounts (100k+)
        do { // check for unique number
            uniNum = 1; // set uniqueness flag
            if (ALLOW_NEG == 1) { // executed if negatives are allowed
                  // This will allow 0, which makes sens if the
                  // range includes negative and positive.
                x = (rand() % MAX_NUMBERS+1) -((MAX_NUMBERS+1)/2); // generate a random number between (-max_nums/2) and (max_nums/2), totaling max_nums. the +1 is a bug fix, ask if curious
            } else { // no negs allowed!
                 // +1 makes the range from 1 to MAX_NUMBERS + 1,
                 // change to zero or remove to range from 0 to MAX_NUMBERS
                x = (rand() % MAX_NUMBERS+1) + 1; // generate random number of only positive ints and 0.
            }

            for (index2 = 0;index2 <= index; index2++){ // check currently generated numbers for dupes
                if (x == arr[index2]) { // dupe found!
                    uniNum = 0; // clear uniqueness flag
                    break; // end the for loop on a dupe, no sense in continuing
                }
            }
        } while(uniNum != 1); // repeat if the number wasn't unique
        arr[index] = x; // log the number

        if (hpos > 4) { // check to see if the horizontal position indicator is greater than 4
            fprintf(f, "%d\n", x); // write to the 5th position horizontally with a newline
            hpos = 1; // reset the horizontal position to the first. this gives me 5 numbers
                      //per line, with differing types of whitespace, just to test the reading
                      //and storing function. see a2.txt
        } else {
            switch (hpos) { // select based on which position we are in
                case 1 :
                    fprintf(f, "%d ", x); // first, space
                    hpos++;
                break;
                case 2 :
                    fprintf(f, "%d\t", x); // second, a tab character
                    hpos++;
                break;
                case 3 :
                    fprintf(f, "%d ", x); // third, another space
                    hpos++;
                break;
                case 4 :
                    fprintf(f, "%d\t", x);// fourth, another tab.. fifth is a newline
                    hpos++;
                break;
            }
        }
    }
    printf("\n%d numbers generated", index); // eh, print it out. why not?

    return 0;
}
  • 5
    You wouldn't by any chance be on Windows using an MS toolchain, would you ? Does it help if I mention RAND_MAX in MSVCRT is 0x7FFF (at least through VS2010; haven't checked recent implementations) ? – WhozCraig Jan 19 '15 at 22:11
  • 2
    @WhozCraig That of course is the answer, but OP may not have sufficient experience to understand the hint. @MichaelWilliams The first thing to do is confirm that `RAND_MAX` is 32767, which you can do by looking in `` or with a `printf("%d\n", RAND_MAX)`. – user3386109 Jan 19 '15 at 22:17
  • from orginal post, "using codeblocks with the ming compiler" – hhafez Jan 19 '15 at 22:48
  • Thanks guys, I checked stdlib.h and sure enough, that was it. I appreciate the quick answer. – Michael Williams Jan 20 '15 at 13:46

1 Answers1

4

The logic in your do loop is:

  • Select a random number
  • Go through the list of previously-accepted numbers and see if it is there
    • If so, go back and pick another random number
    • If not, exit this loop

As suggested in comments, probably your system has RAND_MAX == 32767 so there are only 32768 possible random numbers. So once you have selected every one, then this loop becomes an infinite loop.

The reason it appears to slow down near the end is that it will go a lot of iterations of the do loop (which don't display anything) in between times that it does find a new number.

If you move the printf("\r%d", index); to be inside the do loop (and have a changing display per iteration) you should see this.


To get a bigger range of random numbers, I would use a free implementation of the Mersenne Twister (mt19937), or see here for other options.

Also, your algorithm to generate a list of unique random numbers is extremely inefficient (too much searching of the existing list), see here for an improvement.

Community
  • 1
  • 1
M.M
  • 138,810
  • 21
  • 208
  • 365
  • Explains the sudden slow down, but not the speed up, still +1 for at least solving part of the mystery. – hhafez Jan 19 '15 at 22:47
  • @hhafez What speedup? Your question doesn't mention that. – M.M Jan 19 '15 at 22:50
  • 3
    If you're using most or all of the numbers, Floyd's algorithm won't work well either. Iti's best to just fill the array with all the numbers you need, then shuffle them. – Lee Daniel Crocker Jan 19 '15 at 22:52
  • @LeeDanielCrocker it seems from the question that he might want, say, 100K or 1 million numbers but out of the sample space of all 32bit numbers – M.M Jan 19 '15 at 22:55
  • Ah, if that's the case, and he needs unique numbers, then a variation of Floyd might work. Of course, he'd need a real PRNG as well. – Lee Daniel Crocker Jan 19 '15 at 22:58
  • I wasn't really looking for a solution, as 32k numbers is sufficient for my needs here, more than, really. I was mostly looking for an explanation, which was given above as RAND_MAX. I could have been more clear in my question though, and I will definitely make a note of the above solutions in case I ever need them, thanks. – Michael Williams Jan 20 '15 at 13:49