16

What's the best way to return a random line in a text file using C? It has to use the standard I/O library (<stdio.h>) because it's for Nintendo DS homebrew.

Clarifications:

  • Using a header in the file to store the number of lines won't work for what I want to do.
  • I want it to be as random as possible (the best being if each line has an equal probability of being chosen as every other line.)
  • The file will never change while the program is being run. (It's the DS, so no multi-tasking.)
Paige Ruten
  • 172,675
  • 36
  • 177
  • 197
  • Do you have any control over the file? Can you preprocess the file on your own machine to count the number of lines, and add a header at the top of the file that indicates how many complete lines there are? – cfeduke Oct 24 '08 at 02:08
  • I want the file to be easily editable in a text editor, so updating a header each time the file is edited would take too long. – Paige Ruten Oct 24 '08 at 02:11

8 Answers8

31

Read each line, and use a random number to choose whether to keep that line or ignore it. For the first line, you want odds of 1:1 to keep; for the second, you want odds of 1:2, etc.

count = 0;
while (fgets(line, length, stream) != NULL)
{
    count++;
    if ((rand() * count) / RAND_MAX == 0)
        strcpy(keptline, line);
}

I haven't verified that this has the proper random qualities, but it seems right at first glance.


It has been pointed out that integer overflow would quickly become a problem with the way the comparison is coded, and I had independently reached the same conclusion myself. There are probably many ways to fix it, but this is the first that comes to mind:
if ((rand() / (float)RAND_MAX) <= (1.0 / count)) 
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • I like the use of direct probability over choosing randomly from a list of lines. – Bernard Oct 24 '08 at 02:03
  • I upvoted the answer because I think in theory it's right. However, integer overflow does cause the end result to be skewed. – C. K. Young Oct 24 '08 at 02:12
  • 4
    1st line: 50% chance. 2nd line: (100-50%)*(1/3)=16.7%, 3rd line: (100-50-16.7%)*(1/4)=8.3%, ... This is a valid distribution, but it isn't all uniform (and the OP didn't say it had to be uniform). – Mr Fooz Oct 24 '08 at 02:18
  • just be careful that rand() * count doesn't cause an overflow for larger files (MAX_INT / RAND_MAX is the limit on the number of lines). – Matthew Smith Oct 24 '08 at 02:19
  • 3
    actually with this method, you will get the first line of the file half the time! – Matthew Smith Oct 24 '08 at 02:26
  • There's surely something that I don't understand, If you have 1000 lines, the last line has a very low probability of being kept, but the first line still has 50%. What am I missing? – Martin Cote Oct 24 '08 at 02:27
  • 19
    What the heck is everybody talking about? This solution **does** pick every line with equal probability, regardless of how many lines there are. First line has a 1/1 chance of being picked, but an (n-1)/n chance of being replaced later on. – ephemient Oct 24 '08 at 02:36
  • 1
    tvan and Martin Cote - you are both wrong. See e.g., here for a detailed explanation of the math behind this answer http://www.unix.com.ua/orelly/perl/cookbook/ch08_07.htm – 1800 INFORMATION Oct 24 '08 at 02:37
  • 2
    tvanoffsen - perhaps your misunderstanding is that it doesn't stop when it has "picked" a line - subsequent lines can replace it (with diminishing probability) – 1800 INFORMATION Oct 24 '08 at 02:39
  • Oh I see now, for some reason when I was reading and re-reading the example it didn't occur to me that he didn't exit the loop once he had selected a match (nevermind there was no code there to even do that, I guess I was just seeing what wasn't there). – cfeduke Oct 24 '08 at 02:40
  • Says "Aha" and slaps head. Sorry. – tvanfosson Oct 24 '08 at 02:42
  • 4
    Now that I see how it works, what about using rand() % count == 0 as the test? – tvanfosson Oct 24 '08 at 03:20
  • 1
    This is clever because the number of lines can remain unknown until the end of the file, when you then have your answer. But, if one could know the number of lines, it would be preferred to pick a number, and fetch that. Alternative... I can't give in 64 characters! -frown- BIGGER COMMENTS, PLEASE! – Richard T Oct 24 '08 at 03:35
  • Even though clever, if the file does not change, this becomes a worse solution than Brian's. –  Jun 09 '10 at 21:40
  • @Moron, Brian's solution is only better if you have somewhere to store the number of lines and the starting position of each line, and if you need to select more than one line over the life of that storage. I don't think either condition can be taken as a given. – Mark Ransom Jun 09 '10 at 22:04
9

Mark's answer is almost correct except for two issues:

  1. If a line is longer than length - 1 characters (including the newline), then the while loop will increment count at least twice for the same line: once for the first length - 1 characters, another for the next length - 1 characters, etc.
  2. The calculation of rand() * count can cause an integer overflow.

To solve the first problem, you can call fgets into a trash buffer until it returns NULL (indicating an I/O error or EOF with no data read) or the trash buffer contains a newline:

count = 0;
while (fgets(line, length, stream) != NULL)
{
    char *p = strchr(line, '\n');
    if (p != NULL) {
        assert(*p == '\n');
        *p = '\0'; // trim the newline
    }
    else { // haven't reached EOL yet. Read & discard the rest of the line.
#define TRASH_LENGTH 1024
        char trash[TRASH_LENGTH];
        while((p = fgets(trash, TRASH_LENGTH, stream)) != NULL) {
            if ((p = strchr(trash, '\n')) != NULL) // reached EOL
                break;
        }
    }
    assert(strchr(line, '\n') == NULL); // `line` does not contain a newline
    count++;
    // ...

The second problem can be solved with @tvanfosson's suggestion if floating-point arithmetic is not available:

int one_chance_in(size_t n)
{
    if (rand() % n == 0) // `rand` returns an integer in [0, `RAND_MAX`]
        return 1;
    else
        return 0;
}

But note that rand() % n is not a uniform, discrete random variable even if rand() is assumed to be one because the probability that rand() % n == 0 can be as much as 1/RAND_MAX higher than the desired probability 1/n. On my machine, RAND_MAX is 2147483647, so the difference is 4.66 × 10-10, but the C standard only requires that RAND_MAX be at least 32767 (3.05 × 10-5 difference).

Also, for anyone left wondering why this scheme works (as I was), it might be helpful to work through the calculation of the probability that the first line remains in keptline if there are m lines and generalize: In the first iteration of the loop, the probability that the first line is copied to keptline is 1/1. In the second iteration of the loop, the probability that the second line does not overwrite the first line is 1/2. In the third iteration, the probability that the third line does not overwrite the first line is 2/3. Continuing, the probability that the last line does not overwrite the first line is (m - 1)/m. Thus, the probability that the first line remains in keptline after iterating over all lines is:

1/1 × 1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the second line remains in keptline is:

1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the third line remains in keptline is:

1/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

Etc. They're all 1/m.

Daniel Trebbien
  • 38,421
  • 18
  • 121
  • 193
6

This method is good because:

i) You can keep generating random lines at no big cost

ii) You only have to read the file a total of 1 time + 1 line at a time per random line you want. The excess read data is only equal to the size of the file.

iii) It gives each line a fair chance no matter what its position is in the file.

iv) It gives each line a fair chance no matter what its length is in the file.

The suggestion:

I would suggest a 2 pass algorithm. Well really it's a 1 pass + N lines. Where N is the number of random lines you want.

The first pass you would use to calculate how many lines and the start positions of each line.

You then take a random number from 0 to the number of lines minus 1. Use that random number, which is your line index, get the start position for that line index. Seek to that position.

You then have only 1 more read needed, and you know the exact size. (until the start index of the next line)

How to store the number of lines and the index of each line:

To store the number of lines, you can obviously just use an int.

If you can use a vector then you can add each line index into the vector. If not you can just create an array of ints with the max number of lines you think there will be. Then index into that array.

Other answers:

Another answer mentioned that you can pick a random number from 1 to the size of the file, and then use the closest newline. But this won't work. For example you might have 1 line that is really long and the others that are not that long. In that case you would have an uneven distribution.

Brian R. Bondy
  • 339,232
  • 124
  • 596
  • 636
  • His solution is not good if you have to keep generating random lines. A lot of extra reading that you won't need. – Brian R. Bondy Oct 24 '08 at 02:12
  • It took a couple of tries but you finally came up with a decent answer so I reversed my downvote. I have solved this same problem before using both your method and Mark's, they are both fine solutions, which is best depends on the situation. – Robert Gamble Oct 24 '08 at 02:18
  • If you want a uniform distribution (even odds of picking any line), you need to do something like this. Mark Random's is heavily biased to the first lines in the file (which may or may not be fine for the application). – Mr Fooz Oct 24 '08 at 02:20
  • Mark Random? I don't think anybody's called me that before, but in this case it fits... – Mark Ransom Oct 24 '08 at 02:22
  • Your solution is not good it the content of the file is potentially changing - it's a tradeoff – 1800 INFORMATION Oct 24 '08 at 02:24
  • Mark's and Adam's solutions each "work": they do sample from a random distribution. It just isn't a uniform distribution (which is probably what's desired). If all lines are of similar length, Adam's could be a good approximation. – Mr Fooz Oct 24 '08 at 02:26
  • 1
    @Mr Fooz, Mark's solution (if I interpreted it correctly) is a well-known algorithm for this very problem and appears in TAoCP. At first it might appear to be biased but when you work out the math you can see that this is not the case. – Robert Gamble Oct 24 '08 at 02:26
  • 1800 I guess you could make a copy if you knew it would be changing, but still needed a fair random distribution. Or you could use some kind of file locking. – Brian R. Bondy Oct 24 '08 at 02:26
  • +1: Based on the clarification that the file does not change. This is a _much much_ better solution than the accepted answer. Of course, this assumes that nintendo DS supports seeking in a file. If it does not, then this is a _much_ better solution than the accepted solution. –  Jun 09 '10 at 21:37
  • Ya a lot of people don't consider that 2 pass is not the end of the world considering sometimes people use O(n^2). O(2n) = O(n). That's a great point about if the file changes. Maybe some locking mechanism could be used or else a copy of the file. – Brian R. Bondy Jun 10 '10 at 00:49
3
  1. Get the length of the file.
  2. Pick a random position in the file.
  3. Seek to that position.
  4. Iterate forward until you find a newline character.
  5. If you don't find a newline character, seek back to the beginning.
  6. Use gets() to read the line.
Adam Pierce
  • 33,531
  • 22
  • 69
  • 89
  • 2
    There is a tradeoff here because your method will favour lines that are longer over shorter lines. Mark Ransom's solution doesn't have this problem, but it needs to read the entire file (but not all at once). – 1800 INFORMATION Oct 24 '08 at 02:05
  • Nice answer but it's not a good solution because you won't know the size of each line. – Brian R. Bondy Oct 24 '08 at 02:05
  • This is a good choice for operating within the limited resources of the DS (or any such device). – cfeduke Oct 24 '08 at 02:05
  • Nice. Why not just seek backwards though? – Draemon Oct 24 '08 at 02:26
  • 1
    @Draemon - seeking backwards won't help, consider a file where line 1 is 98 bytes and lines 2 and 3 are both 1 byte. You have a 98% chance of landing in line 1, no matter what you do from there you will have a 98% chance of hitting the same line every time you run the process. – Robert Gamble Oct 24 '08 at 03:06
  • It's sloppy but it's fast and that may be a reasonable tradeoff depending on the kind of data the OP is working with (eg. larger files with not too much variation in line length). – Adam Pierce Oct 24 '08 at 03:15
1

I have an alternative solution. Since the platform is the DS, you'd probably not want to try to hold the file in memory. This reads the file twice. Once to count the lines and the 2nd time to find the line it wants. This would run slower than the other solutions suggested so far but it uses hardly any memory. I even wrote it in C for you (I omitted error handling):

main(int argc, char **argv)
{
    FILE *f;
    int nLines = 0;
    char line[1024];
    int randLine;
    int i;

    srand(time(0));
    f = fopen(argv[1], "r");

/* 1st pass - count the lines. */
    while(!feof(f))
    {
        fgets(line, 1024, f);
        nLines++;
    }

    randLine = rand() % nLines;
    printf("Chose %d of %d lines\n", randLine, nLines);

/* 2nd pass - find the line we want. */
    fseek(f, 0, SEEK_SET);
    for(i = 0; !feof(f) && i <= randLine; i++)
        fgets(line, 1024, f);

    printf("%s", line);
}

UPDATE: Oops, I should have read Brian R. Bondy's answer before I posted this but I was kinda obsessing over writing the code and didn't notice. This is almost the same except it doesn't store the line positions in an array. You could do it either way depending on how big the file is and whether speed is more important than saving memory.

Adam Pierce
  • 33,531
  • 22
  • 69
  • 89
0

All you need do is generate one unscaled random number per line, while maintaining the maximum value for all the random numbers you generate. Whenever you update the maximum value, you overwrite the selected line with the current line.

At the end you get the line associated with the highest number rand() spat out, which should be equally probable among all your lines.

paperhorse
  • 4,095
  • 2
  • 22
  • 12
0

Just a quick note about Mark Ransom's way of avoiding integer overflow: the DS has no FPU, so floating point division will be emulated in software and very slow. You'll want to avoid typecasting/promotion to float or double at all costs, if speed is a concern.

Here's a different way of avoiding integer overflow that avoids any floating point math:

if(rand() <= RAND_MAX / count)

The probabilities may be slightly skewed due to integer division, but this should certainly run much faster on a DS.

Community
  • 1
  • 1
chazomaticus
  • 15,476
  • 4
  • 30
  • 31
-1

Use a combination of Adam's random offset into the file approach and Mark's probability approach. Adam's method can get you randomly to a section of the file. Then you use Mark's approach to avoid preferring the larger strings. Mark's algorithm will prefer the first few strings from wherever it starts,

Matthew Smith
  • 6,165
  • 6
  • 34
  • 35
  • Adam's algorithm is fatally flawed, no matter what you try to do to accommodate for this you cannot overcome the fact that the initial position chosen is going to be skewed based on line length which destroys any possibility of a uniform distribution. – Robert Gamble Oct 24 '08 at 03:03
  • Yeah, I misunderstood Mark's algorithm as having a bias and this was supposed to be a cheap way of getting a uniform-ish distribution – Matthew Smith Oct 27 '08 at 01:55