0
FILE* inp;
inp = fopen("wordlist.txt","r");        //filename of your data file
char arr[100][5];           //max word length 5
int i = 0;
while(1){
    char r = (char)fgetc(inp);
    int k = 0;
    while(r!=',' && !feof(inp)){    //read till , or EOF
        arr[i][k++] = r;            //store in array
        r = (char)fgetc(inp);
    }
    arr[i][k]=0;        //make last character of string null
    if(feof(inp)){      //check again for EOF
        break;
    }
    i++;
}

I am reading the file words and storing them in the array. My question is: how can I randomly select 7 of these words and store them in the array?

The input file has the following content:

https://ibb.co/LkSJ1SV

meal
cheek
lady
debt
lab
math
basis
beer
bird
thing
mall
exam
user
news
poet
scene
truth
tea
way
tooth
cell
oven
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user7777
  • 15
  • 4
  • 1
    What if there are more than 100 words OR if any of the words have more than 4 characters? – जलजनक Apr 28 '22 at 16:10
  • 1
    Read about `fgetc` return value. It is not char, and that has great significance. – hyde Apr 28 '22 at 16:12
  • ı know that there are 50 words in file and it has max length 5 char. – user7777 Apr 28 '22 at 16:15
  • 1
    I strongly suggest that you read this: [Why is “while ( !feof (file) )” always wrong?](https://stackoverflow.com/q/5431941/12149471) – Andreas Wenzel Apr 28 '22 at 16:19
  • https://ibb.co/LkSJ1SV seperated by newline – user7777 Apr 28 '22 at 16:22
  • ı read the link that you posted but ı am beginner at c so ı could not understand where is the problem can you help how to fix this – user7777 Apr 28 '22 at 16:50
  • Your previous comment seems to have been intended for me. Note that I will not be notified of your comment, unless you write my name using the `@` syntax. Press the `Help` button while writing a comment for further information. If you do not notify me, then I may not notice your comment. Only the owner of the post to which the comment is attached (which is you) will automatically be notified of comments to that post. Therefore, you should generally use the `@` syntax when replying to posts, when the comment is attached to your question. – Andreas Wenzel Apr 28 '22 at 16:53
  • At the moment, your question title "picking random string from array" is inaccurate. That may be where you're headed, but at the moment, you're dealing with a different problem — reading words from a file into an array. Limiting words to 4 characters is nonsense (most of the words in this comment won't fit into 5 characters). You must ensure you don't read more than 100 words. You must make sure that none of the words are more than 4 characters long. It's not clear that spaces, tabs, newlines are parts of words, but you seem to treat them as such. – Jonathan Leffler Apr 28 '22 at 16:57
  • 1
    Actually, after looking at your code more closely, I believe that your use of `feof` is not wrong in this case. The only problem is that it only tests for end-of-file and not for input error, but as a beginner, you don't have to worry about that for now. I have therefore withdrawn and deleted my comment in which I stated that you are using `feof` wrongly. – Andreas Wenzel Apr 28 '22 at 17:02
  • If the words are separated by newline characters and not commas, why are you searching for a `','` in the input stream? – Andreas Wenzel Apr 28 '22 at 17:08
  • Side note: It would probably be easier to use the function [`fgets`](https://en.cppreference.com/w/c/io/fgets) to read one line at a time, instead of reading the file one character at a time. However, your use of the function [`fgetc`](https://en.cppreference.com/w/c/io/fgetc) is, as far as I can tell, not wrong, except for the fact that the array only has space for 4 characters plus the terminating null character, but some words are 5 characters long. – Andreas Wenzel Apr 28 '22 at 17:13
  • Given the actual text from the link, you have zero commas in the file (and newlines separating the words) and the names are up to 5 characters long, so in C you need an array at least as big as `char word[6]` for each word (but there are many words longer than 5 characters in general inputs). If the input really is comma-separated on a single line, then your link is misleading. And it demonstrates the importance of showing the input data (as well as the expected output). That's part of creating a Minimal, Complete, Verifiable Example (a [MCVE]) — see also [SSCCE](http://sscce.org/). – Jonathan Leffler Apr 28 '22 at 18:00
  • 1
    What you're seeking is "Reservoir Sampling". From Wikipedia on [Reservoir Sampling](https://en.wikipedia.org/wiki/Reservoir_sampling): _**Algorithm R** The most common example was labelled Algorithm R by Jeffrey Vitter … This simple O(n) algorithm as described in the Dictionary of Algorithms and Data Structures consists of the following steps …_ (code omitted). Also [What's the best way to return a random line in a text file using C?](https://stackoverflow.com/q/232237/15168). And Perl's [perlfaq5](https://perldoc.perl.org/perlfaq5) has an item that cites Knuth's TAOCP Vol 2 §3.4.2. – Jonathan Leffler Apr 28 '22 at 18:17
  • Does this answer your question? [C - Get random words from text a file](https://stackoverflow.com/questions/43214157/c-get-random-words-from-text-a-file) – ioannis-kokkalis Apr 28 '22 at 21:42

2 Answers2

2

Important

This solution has weak random distribution of word picking but considers any length of total words input from the file. This question have been asked to look closer into this issue.

The Idea

  • Chars per word to keep things simple will be 256 at max (including \0) so we are safe for small examples.
  • The array to use will have the size of total words you want to save at the end. That's because there is no need to store all the words and then pick 7. You can pick 7 words while reading the file by overwritting previous words at random.
  • First while loop will make sure to fill the array so there are no empty cells.
  • Second while loop will be overwriting previous cells at random.

Solution

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define TOTAL_WORDS 7
#define CHARS_PER_WORD 256

void readWord(FILE* inp, char [TOTAL_WORDS][CHARS_PER_WORD], int i);

int main(int argc, char const *argv[]) {
    srand(time(NULL));

    char arr[TOTAL_WORDS][CHARS_PER_WORD] = { 0 };

    FILE* inp;
    inp = fopen("wordlist.txt","r");
    // make sure file opening did not fail
    if( inp == NULL ) {
        printf("Could not open file.\n");
        return 0;
    }

    int i = 0;

    while( i < TOTAL_WORDS && !feof(inp) )
        readWord(inp,arr,i++);

    while( !feof(inp) ) {
        if( (rand()%  2) == 1 )
            readWord(inp,arr,rand() % TOTAL_WORDS);
        else // consume the word without saving it
            while( fgetc(inp)!='\n' && !feof(inp) ) { } 
    }

    for( int i = 0; i<TOTAL_WORDS; i++ ) 
        printf("%d: %s\n", i, arr[i]);

    return 0;
}

void readWord(FILE* inp, char arr[TOTAL_WORDS][CHARS_PER_WORD], int i) {
    int k = 0;
    char r = (char) fgetc(inp);
    while( r!='\n' && !feof(inp) ){
        arr[i][k++] = r;
        r = (char) fgetc(inp);
    }
    arr[i][k]='\0';  
}

With input file wordlist.txt containing:

meal
cheek
lady
debt
lab
math
basis
beer
bird
thing
mall
exam
user
news
poet
scene
truth
tea
way
tooth
cell
oven

One of the results was:

0: scene
1: truth
2: tooth
3: oven
4: way
5: user
6: cell

Additions/Changes Explained

C libraries that contain funtions srand(), time() and rand() which will be used to randomize stuff.

#include <stdlib.h>
#include <time.h>

Definining how many words we want to keep at the end. This will be usefull if we want to change from 7 to something else. Instead of using 7 all over the palce we will use TOTAL_WORDS each time we want to refer to how many words we want to keep. In the same note defining how many chars per word.

#define TOTAL_WORDS 7
#define CHARS_PER_WORD 256

Initialize seed for funtion rand(). You can read more about it here.

srand(time(NULL));

Get a number in range of our array size. You can read more about rand() function here.

rand() % TOTAL_WORDS

To avoid repeating the same thing in both while loops, the part where you read a word from the file got encapsulated in a funtion. That makes main code a lot simpler to read and maintain.

void readWord(FILE* inp, char arr[TOTAL_WORDS][CHARS_PER_WORD], int i) {
    ...  
}

Print words saved.

for( int i = 0; i<TOTAL_WORDS; i++ ) 
    printf("%d: %s\n", i, arr[i]);



Slightly Improved Version

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define INPUT_FILE "wordlist.txt"
#define TOTAL_WORDS 7
#define CHARS_PER_WORD 256
#define DELIMETER '\n'

FILE* input;

char word[CHARS_PER_WORD];
char words[TOTAL_WORDS][CHARS_PER_WORD];

void openFile();    
void readWord();
void saveWord(int position);
void pickWords();
void printWords();
int hasWordAt(int position);
int isFull();

int main(int argc, char const *argv[]) {
    srand(time(NULL));

    openFile();

    pickWords();

    printWords();

    return 0;
}

void printWords() {
    for( int i = 0; i<TOTAL_WORDS; i++ ) 
        printf("%d: %s\n", i, words[i]);
}

void pickWords() {
    int pos;
    while( !feof(input) && !isFull() ) {
        readWord();
        do {
            pos = rand() % TOTAL_WORDS;
        } while( hasWordAt(pos) );
        saveWord(pos);
    }
    while( !feof(input) ) {
        readWord();
        if( (rand() % 2) == 0 )
            continue;
        pos = rand() % TOTAL_WORDS;
        saveWord(pos);
    }
}

int hasWordAt(int position) {
    return words[position][0] != '\0';
}

int isFull() {
    for( int i = 0; i<TOTAL_WORDS; i++ ) 
        if( words[i][0] == '\0' )
            return 0;
    return 1;
}

void saveWord(int position) {
    strcpy(words[position],word);
}

void readWord() {
    int i = 0;
    char ch = (char) fgetc(input);
    while( ch != DELIMETER && !feof(input) ){
        word[i++] = ch;
        ch = (char) fgetc(input);
    }
    word[i]='\0';  
}

void openFile() {
    input = fopen(INPUT_FILE,"r");
    if( input == NULL ) {
        printf("Couldn't open file.");
        exit(0);
    }
}
ioannis-kokkalis
  • 432
  • 1
  • 3
  • 13
  • The random distribution (i.e. the randomness) of the result is not very good, using this algorithm. The words near the end of the list have a much higher probability of being chosen in the result. The last word in the list is even guaranteed to be chosen (only its position in the result is random). If you want to prevent the same random word from being chosen twice, then you may want to read about the [Fisher-Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle). – Andreas Wenzel Apr 28 '22 at 18:56
  • @AndreasWenzel You are right, but not sure in the last part. First when you suggest doing the permutations for better distribution and second this won't prevet last words from being included each time. – ioannis-kokkalis Apr 28 '22 at 19:07
  • I meant doing a Fisher-Yates shuffle on all words, not only on the 7 selected words. – Andreas Wenzel Apr 28 '22 at 19:08
  • Ohh yes that makes sense, but in my solution i am not keeping all the words so it won't help. – ioannis-kokkalis Apr 28 '22 at 19:09
  • I suggest that you read [this question](https://stackoverflow.com/q/232237/15168), which has an answer which uses an algorithm similar to yours, but has a better distribution. This link was originally provided by someone else in the comments section of the current question. – Andreas Wenzel Apr 28 '22 at 20:17
  • @AndreasWenzel thanks! That's what i was looking now. Decreasing the propability of keeping the new word for each word does the trick. Question: the first 7 words should always stay in random positions or they follow the same probability logic? – ioannis-kokkalis Apr 28 '22 at 20:25
  • Actually, I am not sure if the algorithm described in the other question can be adapted to select several words, while still maintaining uniform distribution. – Andreas Wenzel Apr 28 '22 at 20:33
  • @AndreasWenzel seems like the solution may be given from [Reservoir's Sampling](https://en.wikipedia.org/wiki/Reservoir_sampling). Found something similar to this problem on this [question](https://stackoverflow.com/questions/43214157/c-get-random-words-from-text-a-file), checking if it is gonna work or not. – ioannis-kokkalis Apr 28 '22 at 21:24
2

First of all, your program has the following issues:

  1. In your posted input, some of the words are 5 characters long, but your array only has room for 4 characters plus the terminating null character.
  2. The words in your posted input are separated by newline characters, not commas. Therefore, it does not make sense that you are searching the input stream for ',' instead.

After fixing these two issues in your code and adding a function main and all necessary headers, it should look like this:

#include <stdio.h>

int main( void )
{
    FILE* inp;
    inp = fopen("wordlist.txt","r");        //filename of your data file
    char arr[100][6];           //max word length 5
    int i = 0;
    while(1) {
        char r = (char)fgetc(inp);
        int k = 0;
        while(r!='\n' && !feof(inp)) {   //read till , or EOF
            arr[i][k++] = r;            //store in array
            r = (char)fgetc(inp);
        }
        arr[i][k]=0;        //make last character of string null
        if(feof(inp)){      //check again for EOF
            break;
        }
        i++;
    }
}

In C, it is common to use the function rand to generate a random number between 0 and RAND_MAX. The macro constant RAND_MAX is guaranteed to be at least 32767.

In order to get a random number between 0 and i (not including i itself), you can use the following expression, which uses the modulu operator:

rand() % i

This will not give you an even distribution of random numbers, but it is sufficient for most common purposes.

Therefore, in order to select and print a random word, you can use the following statement:

printf( "%s\n", rand() % i );

If you want to select and print 7 random words, then you can run this statement in a loop 7 times. However, it is possible that the same word will be selected randomly several times. If you want to prevent this from happening, then you will have to use a more complex algorithm, such as a Fisher-Yates shuffle.

However, this will print the same random sequence of words every time you run your program. If you want the random number generator to generate a different sequence of random numbers every time the program is run, then you must seed the random number generator, by calling the function srand with some random data.

The simplest source of randomness is the current time. The function time will return an integer representing the current time, usually in seconds.

srand( (unsigned)time(NULL) );

However, since the function time usually uses seconds, this means that if you run the program twice in the same second, the random number generator will be seeded with the same value, so it will generate the same sequence of random numbers. If this is an issue, then you may want to find some other source of randomness.

After doing everything described above and adding the necessary headers, your program should look like this:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main( void )
{
    FILE* inp;
    inp = fopen("wordlist.txt","r");        //filename of your data file
    char arr[100][6];           //max word length 5

    srand( (unsigned)time(NULL) );

    int i = 0;
    while(1) {
        char r = (char)fgetc(inp);
        int k = 0;
        while(r!='\n' && !feof(inp)) {   //read till , or EOF
            arr[i][k++] = r;            //store in array
            r = (char)fgetc(inp);
        }
        arr[i][k]=0;        //make last character of string null
        if(feof(inp)){      //check again for EOF
            break;
        }
        i++;
    }

    //print 7 random words
    for ( int j = 0; j < 7; j++ )
        printf( "%s\n", arr[rand()%i] );
}

For the input

meal
cheek
lady
debt
lab
math
basis
beer
bird
thing
mall
exam
user
news
poet
scene
truth
tea
way
tooth
cell
oven

this program gave me the following (random) output:

user
mall
poet
lab
cheek
lab
beer

As you can see, one of the random words is a duplicate.

As previously stated, you can shuffle the array using a Fisher-Yates shuffle if you want to prevent the same word from being chosen twice. After shuffling the array, you can simply select and print the first 7 elements of the array, if that is the number of words that you want to chose:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

int main( void )
{
    FILE* inp;
    inp = fopen("wordlist.txt","r");        //filename of your data file
    char arr[100][6];           //max word length 5

    srand( (unsigned)time(NULL) );

    int i = 0;
    while(1) {
        char r = (char)fgetc(inp);
        int k = 0;
        while(r!='\n' && !feof(inp)) {   //read till , or EOF
            arr[i][k++] = r;            //store in array
            r = (char)fgetc(inp);
        }
        arr[i][k]=0;        //make last character of string null
        if(feof(inp)){      //check again for EOF
            break;
        }
        i++;
    }

    //perform a Fisher-Yates shuffle on the array
    for ( int j = 0; j < i - 1; j++ )
    {
        char temp[6];

        int k = rand() % ( i - j ) + j;

        if ( j != k )
        {
            //swap both array elements
            strcpy( temp, arr[j] );
            strcpy( arr[j], arr[k] );
            strcpy( arr[k], temp );
        }
    }

    //print first 7 elements of the shuffled array
    for ( int j = 0; j < 7; j++ )
    {
        //NOTE: This code assumes that i > 7, otherwise
        //it may crash.

        printf( "%s\n", arr[j] );
    }
}

Now, the same word can no longer be chosen twice:

meal
thing
news
user
mall
exam
tea

In my program above, I shuffled the entire array. However, if I only need the first 7 words to be randomized, then it would be sufficient to only perform 7 iterations of the outer loop when shuffling.

Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39