0

I'm writing a program that basically searches a directory and all its sub-directories for duplicate files. I have refined both the question and the code according to your suggestions (functions that needed to return a default value have been fixed to do so) so here it goes...

Here is the code of the comparing functions:

int compare()
{
    int a, b;
    unsigned char byte1, byte2;

    while(1)
    {
        a = fread(&byte1, 1, 1, file1);
        b = fread(&byte2, 1, 1, file2);
        if(a == 0 && b == 0) break;
        if(a != b) return 1;
        if(byte2 != byte1) return 1;
    }

    return 0;
}

void startCompare()
{
    char path1[1000], path2[1000];
    FILE *reference = fopen("list.comp", "r");
    FILE *other = fopen("list2.comp", "r");
    int i, flag, j;
    i = 0;

    while(fgets(path1, 1000, reference))
    {
        flag = 0;
        strtok(path1, "\n");  
        openFile1(path1);
        for(j = 0; j <= i; ++j)
        {
            fgets(path2, 1000, other);
        }
        while(fgets(path2, 1000, other))
        {
            strtok(path2, "\n");
            openFile2(path2);
            if(!compare())
            {
                printf("Checking: %s vs. %s --> DUPLICATE\n", path1, path2);
                flag = 1;
                break;
            }
            else
            {
                printf("Checking: %s vs. %s --> DIFFERENT\n", path1, path2);
            }
        }
        if(flag == 1)
        {
            printf("Will be deleted.\n");
        }
    }
}

(The startCompare() function is called first)

Now, the directory itself has these files:

  • bloblo
  • bloblo/frofo
  • bloblo/frofo/New Folder
  • bloblo/frofo/New Folder (2)
  • bloblo/frofo/New Folder (2)/New Folder (3)
  • bloblo/frofo/New Folder (2)/New Folder (3)/New Text Document.txt
  • bloblo/frofo/New Folder (2)/New Folder (3)/Untitled4
  • 0.comp
  • 1.comp
  • 2.comp
  • 3.comp
  • 4.comp
  • 5.comp
  • 11.comp
  • 100.comp
  • duplicate_delete.dev
  • duplicate_delete.exe
  • duplicate_delete.layout
  • list.comp
  • list2.comp
  • main.c
  • main.o
  • Makefile.win
  • Untitled5.c
  • Untitled5.exe

The output is:

Checking: 0.comp vs. 1.comp --> DIFFERENT
Checking: 0.comp vs. 100.comp --> DIFFERENT
Checking: 0.comp vs. 11.comp --> DIFFERENT
Checking: 0.comp vs. 2.comp --> DIFFERENT
Checking: 0.comp vs. 3.comp --> DIFFERENT
Checking: 0.comp vs. 4.comp --> DIFFERENT
Checking: 0.comp vs. 5.comp --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.dev --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.exe --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.layout --> DIFFERENT
Checking: 0.comp vs. list.comp --> DIFFERENT
Checking: 0.comp vs. list2.comp --> DIFFERENT
Checking: 0.comp vs. main.c --> DIFFERENT
Checking: 0.comp vs. main.o --> DIFFERENT
Checking: 0.comp vs. Makefile.win --> DIFFERENT
Checking: 0.comp vs. Untitled5.c --> DIFFERENT
Checking: 0.comp vs. Untitled5.exe --> DIFFERENT

With return code 0.

While what it should be printing is every file being checked with each other and finding that the files 100.comp and 11.comp are a duplicate of each other and the other files unique. So basically, why does it stop there? Why doesn't it continue to check? Is there any way this can be solved?

user253751
  • 57,427
  • 7
  • 48
  • 90
user3136933
  • 5
  • 1
  • 5
  • Please clarify which files you have (the formatting of your list got messed up), what the expected output is, and what output you get. Please try removing files that are not needed to demonstrate the problem (reading [this](http://stackoverflow.com/help/mcve) might also help). Also, please ask a question about *one* version of your code - either with or without `fclose` - this will reduce confusion. – anatolyg Jul 19 '15 at 21:28
  • 1
    `compare` uses `feof` wrong and lacks a return statement at the end. – melpomene Jul 19 '15 at 21:29
  • First of all make sure all your functions that return a value DO return a value. `main` is declared as returning `int`, but does not return nothing. `openFile1` returns 0 when you cannot open the file, but returns garbage from the stack when you can. The same for `openFile2`. What does return `compare` when all the ifs statements are not true? Garbage... – nsilent22 Jul 19 '15 at 21:35
  • 1
    Please note that `sizeof(unsigned char)` is by definition **1**. – Weather Vane Jul 19 '15 at 21:48
  • Your recent edit has confused the entire post. Recommend 1) Revert post to before "refined both the question and the code according to your suggestions" 2) Accept @Weather Vane 3) Post a _new_ question that refers to this one 4) Make certain your new post is completely compilable, Shows the remaining problem with minimum lines of input. – chux - Reinstate Monica Jul 20 '15 at 00:24

3 Answers3

1

I don't know whether this will answer your TLDR question and code, but this is too much for a comment.

Your function compare() never returns 0, if you had enabled and taken notice of compiler warnings, you would have known this. The function also uses the dreaded feof(). See why feof() is wrong

I suggest replacing this

int compare()
{
    while(!feof(file1))
    {
        fread(&byte1, sizeof(unsigned char), 1, file1);
        fread(&byte2, sizeof(unsigned char), 1, file2);
        if(byte2 != byte1) return 1;
    }
    if(feof(file1) && (!feof(file2))) return 1;
    if(feof(file2) && (!feof(file1))) return 1;
}

with this, because checking the amount of data read by fread() is the way to test for end of file.

int compare()
// return 0 if files are the same
// *** always include a comment to tell you what the function does / returns ***
{
    size_t read1, read2;
    while(1) {
        read1 = fread(&byte1, 1, 1, file1);
        read2 = fread(&byte2, 1, 1, file2);
        if (read1==0 && read2==0)
            break;             // success: both files ended
        if (read1 != read2)
            return 1;          // bad: one of them read, other didn't
        if (byte2 != byte1)
            return 1;          // bad: files read different data
    }
    return 0;
}

and note that sizeof(unsigned char) is quite unnecessary, it is 1.

I would also have put byte1 and byte2 as local variables.

Community
  • 1
  • 1
Weather Vane
  • 33,872
  • 7
  • 36
  • 56
0

To answer the question you actually asked (why doesn't it compare all pairs of files):

Your program first reads the first line from list.comp. Then, it reads each line from list2.comp and compares the files.

Then, it reads the next line from list.comp, and tries to compare it with all the files from list2.comp again. But it's already at the end of list2.comp, so it doesn't read any more filenames from list2.comp.

You could use rewind(other); to "rewind" back to the beginning of list2.comp, so you can read the filenames again.

user253751
  • 57,427
  • 7
  • 48
  • 90
0

Sorry, but if you are searching for duplicate files, you are using an inefficient approach. It's better to run md5sum(1) on them and sort the listing produced. Then, you'll compare one line with the next for equal md5 values. In case you don't trust md5sum(1) (somebody says that different files can give the same checksum), you can just compare the file contents (but only on the already matching md5 checksums). This is, by far, much more efficient than your approach. It can be solved with:

find <dir> -type f -name "glob_pattern" -print0 | xargs -0 md5sum | sort >files.md5sum

then, edit the file files.md5sum and search for /^\([0-9a-f]*\) .*\n\1/ pattern to get the repeated md5's. You'll get even multiple repetitions of the same file.

Note

Take care that empty files have all the same MD5 checksum and all of them compare equal. You'll get this also in your listing.

Community
  • 1
  • 1
Luis Colorado
  • 10,974
  • 1
  • 16
  • 31