Check duplicates words in a file

Question

I want to check if there are any duplicates in a .txt file. I've wrote a code but it's not running. I'm not sure about opening the norep.txt file in "a+" mode. The idea is to put the first word of my text in the norep.txt file, then compare every word in the text.txt with the words in norep.txt and copy only the words I need in the file.

#include <stdio.h>
#include <stdlib.h>

int main() {
    FILE *fd;
    FILE *ft;
    char aux[30];
    char aux1[30];    
    int len;    

    fd = fopen("c:\\text.txt", "r");
    if (fd == NULL) {
        puts("Error");
    }

    ft = fopen("c:\\norep.txt", "a+");
    if (ft == NULL) {
        puts("Error");
    }

    fscanf(fd, "%s", aux);
    fprintf(ft, "%s", aux);
    rewind(fd);
    rewind(ft);
    while (!feof(fd)) {
        fscanf(fd, "%s", aux);

        while (!feof(ft)) {
            fscanf(ft, "%s", aux1);
            len = strcmp(aux, aux1);

            if (len != 0) {
                fprintf(ft, "%s", aux);
            }
        }
        rewind(ft);
    }
    return 0;
}

The program end and the program not run are very different things. — Mad Physicist, Feb 03 '16 at 19:29
Please read [Why is “while ( !feof (file) )” always wrong?](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong). — lurker, Feb 03 '16 at 19:29
Also, exit if you encounter a fatal error, don't just print and keep going. — Mad Physicist, Feb 03 '16 at 19:30
Unless the file is several gigabytes long, I would keep the list of words in memory, as a [binary search tree](https://en.wikipedia.org/wiki/Binary_search_tree), or as a [trie](https://en.wikipedia.org/wiki/Trie), or in a [hash table](https://en.wikipedia.org/wiki/Hash_table). But the simplest method is just to read the words into an unsorted array. Then sort the array, and scan the array for duplicates. Hint: the `realloc` function will be useful in creating the array. — user3386109, Feb 03 '16 at 19:31
`fscanf(fd, "%s", aux1)` is bad as 1) does not limit input length 2) the result of the function is not checked. — chux - Reinstate Monica, Feb 03 '16 at 23:03

sabbahillel · Answer 1 · 2016-02-03T20:07:08.973

2

You should flush the output file before you rewind it.

fflush - flush a stream or fflush

Of course, this will not fix your problem because:

Note below that the manual says that reposition operations are ignored so that your attempt to read will always find the end of file.

append: Open file for output at the end of a file. Output operations always write data at the end of the file, expanding it. Repositioning operations (fseek, fsetpos, rewind) are ignored. The file is created if it does not exist.

What you should probably do is create an internal memory table that keeps all the unique entries and write it out to a new file after all processing is done. As you read the fd file, check the list and add a new entry if it is not already in the list. Then after you have finished processing fd, then and only then write out your list. Of course, this may be too big depending on the size of your data file.

You could append each unique entry to the output file as you go. but you would need to have some method of checking the previous entries without trying to read the output file.

edited Feb 03 '16 at 20:07

answered Feb 03 '16 at 19:38

sabbahillel

4,357
1
19
36

how do i flush the output? i suppose the program close while it's trying to write the new word in the norep.txt – Marco Feb 03 '16 at 19:47
1

@Marco I added a pointer to the fflush() entry in the two manuals. – sabbahillel Feb 03 '16 at 19:51
@Marco The manual seems to say that because it is marked as "append" then the data written to the disk must always be at the end of the file. This seems to be a requirement of Unix. – sabbahillel Feb 03 '16 at 20:02
1

@Marco I added an update to the answer but I do not have anything that I can say beyond this. – sabbahillel Feb 03 '16 at 20:08

score 2 · Answer 2 · answered Feb 03 '16 at 20:43

The usual way to go about this is to read the input file word for word, store the necessary information in some way and then, after you have read all information from the file, write the desired output to the output file.

A rough skeleton of that approach might look like this:

int main()
{
    const char *infile = "text.txt";
    const char *outfile = "norep.txt";

    FILE *in;
    FILE *out;

    char word[30];

    // (1) Read all words

    in = fopen(infile, "r");      // .. and enforce success

    while (fscanf(in, "%29s", word) == 1) {
        // store word somewhere
    }        
    fclose(in);

    // (2) Determine unique words somehow

    // (3) Write out unique words

    out = fopen(outfile, "w");    // .. and enforce success

    for (i = 0; i < nunique; i++) {
        fprintf(out, "%s\n", unique[i]);
    }        
    fclose(out);

    return 0;
}

The actual algorithm to fin the unique words is missing from this incomplete skeleton code.

If you really want to test the words in a file for uniqueness without using additional memory beyond the current word, you can open the input file twice, with independent file pointers. Then you can write a loop like so:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main()
{
    const char *infile = "text.txt";
    const char *outfile = "norep.txt";

    FILE *in1;
    FILE *in2;
    FILE *out;

    char word1[30];
    char word2[30];

    in1 = fopen(infile, "r");
    in2 = fopen(infile, "r");
    out = fopen(outfile, "w");

    if (in1 == NULL || in2 == NULL || out == NULL) {
        fprintf(stderr, "Could not open all required files.\n");
        exit(1);
    }

    while (fscanf(in1, "%29s", word1) == 1) {
        int count = 0;

        while (fscanf(in2, "%29s", word2) == 1) {
            if (strcmp(word1, word2) == 0) count++;
            if (count > 1) break;
        }

        if (count == 1) fprintf(out, "%s\n", word1);
        rewind(in2);
    }

    fclose(in1);
    fclose(in2);
    fclose(out);

    return 0;
}

This will, of course, re-read the file as often as there are words in the file. Not a good approach to find the unique words in Moby-Dick. I recommend that you look into the memory-based approach.

how can i modify the loop if i need to put one time the non unique words in the file? — Marco, Feb 03 '16 at 22:24
That's obvious, isn't it? A unique word occurs exactly once, a repeated word occurs more often. So you just need to change the condition for writing to `out` to `if (count > 1) ...`. — M Oehm, Feb 04 '16 at 06:07
No, because i want to copy the non unique words only one time, in this way the program just copy the non unique words many time as they appears — Marco, Feb 04 '16 at 14:39
Okay, you are right, the solution works only for unique words. You could probably work around that, but it would make a solution that is already quite ugly even uglier. Don't use the file approach, come up with a solution that stores the words in memory. In the comments, user3386109 has outlined some possibilities; his "simples method" would work for small files, I think. — M Oehm, Feb 04 '16 at 14:46

Check duplicates words in a file

2 Answers2