Removing neighboring duplicate lines from a file using C

Question

Empty lines also should be removed if they are duplicates. If line has escape sequences (like \t), it's different than empty line. Code below is deleting too many lines, or sometimes leave duplicates. How to fix this?

#include <stdio.h>
#include <stdlib.h>

int main()
{
    char a[6000];
    char b[6000];
    int test = 0;
    fgets(a, 6000, stdin);
    while (fgets(b, 6000, stdin) != NULL) {
        for (int i = 0; i < 6000; i++) {
            if (a[i] != b[i]) {
                test = 1;
            }
        }
        if (test == 0) {
            fgets(b, 6000, stdin);
        } else {
            printf("%s", a);
        }
        int j = 0;
        while (j < 6000) {
            a[j] = b[j];
            j++;
        }
        test = 0;
    }
    return 0;
}

If the input of either of the lines is less than 5999 characters (including newline), then part of your arrays will be uninitialized and will have *indeterminate* contents. — Some programmer dude, Oct 21 '17 at 11:46
Lots of issues here. The string comparison tests all 6000 bytes of `a` and `b`, even if they actually contain strings that are much shorter. You aren't stripping the newline characters from the end of each input line, so if the file ends on two identical lines without a final newline character, then both lines will be left behind. Your program will also fail if any line is longer than 5999 characters. — r3mainer, Oct 21 '17 at 11:54
@squeamishossifrage a line without newline is different from a line with newline! A line without newline isn't even valid in POSIX (a line in text file is *terminated* by `\n`) — Antti Haapala -- Слава Україні, Oct 21 '17 at 12:01
Also, instead of `6000`, if this is POSIX system I'd just use `LINE_MAX` from [``](http://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html) instead of the magic number. — Antti Haapala -- Слава Україні, Oct 21 '17 at 12:02
@AnttiHaapala Well, OK. But that's not how [`uniq`](https://en.wikipedia.org/wiki/Uniq) behaves. — r3mainer, Oct 21 '17 at 13:22
`fgets()` guarantees that it will zero terminate the string it reads, so you can use `strcmp()` or `strncmp()` or (if you wish) do comparisons by hand that check for the terminator - that will avoid undefined behaviour that results from checking all characters in the buffer. You also need to allow for the possibility of a line in the file longer than the buffer length - which `fgets()` does handle, but you need to check for it. — Peter, Oct 21 '17 at 13:27

score 3 · Answer 1 · 2017-10-21T13:33:18.990

Your logic is mostly sound. You are on the right track with your train of thought:

Read a line into previous (a).
Read another line into current (b).
If previous and current have the same contents, go to step 2.
Print previous.
Move current to previous.
Go to step 2.

This still has some problems, however.

Unnecessary line-read

To start, consider this bit of code:

while(fgets(b,6000,stdin)!=NULL) {
    ...
    if(test==0) {
        fgets(b,6000,stdin);
    }
    else {
        printf("%s",a);
    }
    ...
}

If a and b have the same contents (test==0), you use an unchecked fgets to read a line again, except you read again when the loop condition fgets(b,6000,stdin)!=NULL is evaluated. The problem is that you're mostly ignoring the line you just read, meaning you're moving an unknown line from b to a. Since the loop already reads another line and checks for failure appropriately, just let the loop read the line, and invert the if statement's equality test to print a if test!=0.

Where's the last line?

Your logic also will not print the last line. Consider a file with 1 line. You read it, then fgets in the loop condition attempts to read another line, which fails because you're at the end of the file. There is no print statement outside the loop, so you never print the line.

Now what about a file with 2 lines that differ? You read the first line, then the last line, see they're different, and print the first line. Then you overwrite the first line's buffer with the last line. You fail to read another line because there aren't any more, and the last line is, again, not printed.

You can fix this by replacing the first (unchecked) fgets with a[0] = 0. That makes the first byte of a a null byte, which means the end of the string. It won't compare equal to a line you read, so test==1, meaning a will be printed. Since there is no string in a to print, nothing is printed. Things then continue as normal, with the contents of b being moved into a and another line being read.

Unique last line problem

This leaves one problem: the last line won't be printed if it's not a duplicate. To fix this, just print b instead of a.

The final recipe

Assign 0 to the first byte of previous (a[0]).
Read a line into current (b).
If previous and current have the same contents, go to step 2.
Print current.
Move current to previous.
Go to step 2.

As you can see, it's not much different from your existing logic; only steps 1 and 4 differ. It also ensures that all fgets calls are checked. If there are no lines in a file, nothing is printed. If there is only 1 line in a file, it is printed. If 2 lines differ, both are printed. If 2 lines are the same, the first is printed.

Optional: optimizations

Instead of checking all 6000 bytes, you only check up to the first null byte in either string since fgets will automatically add one to mark the end of the string.
Faster still would be to add a break statement inside the if statement of your for loop. If a single byte doesn't match, the entire line is not a duplicate, so you can stop comparing early—a lot faster if only byte 10 differs in two 1000-byte lines!

score 2 · Answer 2 · answered Oct 21 '17 at 13:56

#include <stdio.h>
#include <string.h>

int main(void)
{
char buff[2][6000];
unsigned count=0;
char *prev=NULL
        , *this= buff[count%2]
        ;
while( fgets(this, sizeof buff[0] , stdin)) {
        if(!prev || strcmp(prev, this) ) { // first or different
                fputs(this, stdout);
                prev=this;
                count++;
                this=buff[count%2];
                }
        }
fprintf(stderr, "Number of lines witten: %u\n", count);
return 0;
}

H.S. · Answer 3 · 2017-10-21T19:55:13.783

There are few problems in your code, like :

    for(int i=0; i<6000; i++) {
        if(a[i]!=b[i]) {
            test=1;
        }
    }

In this loop, every time the whole buffer will be compared character by character even if it finds if(a[i]!=b[i]) for some value of i. Probably you should break loop after test=1.

Your logic will also not work for a file with just 1 line as you are not printing line outside the loop.

Another problem is fixed length buffer of size of 6000 char.

May you can use getline to solve your problem. You can do -

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main()
{
        char * line = NULL;
        char * comparewith = NULL;
        int notduplicate;
        size_t len = 0;
        ssize_t read;

        while ((read = getline(&line, &len, stdin)) != -1) {
                ((comparewith == NULL ) || (strcmp (line, comparewith) != 0)) ? (notduplicate = 1) : (notduplicate = 0);
                if (notduplicate) {
                        printf ("%s\n", line);
                        if (comparewith != NULL)
                                free(comparewith);
                        comparewith = line;
                        line = NULL;
                }
        }

        if (line)
                free (line);

        if (comparewith)
                free (comparewith);

        return 0;
}

An important point to note:

getline() is not in the C standard library. getline() was originally GNU extension and standardized in POSIX.1-2008. So, this code may not be portable. To make it portable, you'll need to roll your own getline() something like this.

score 0 · Answer 4 · answered Oct 22 '17 at 13:18

Here is a much simpler solution that has no limitation on line length:

#include <stdio.h>

int main(void) {
    int c, last1 = 0, last2 = 0;
    while ((c = getchar()) != EOF) {
        if (c != '\n' || last1 != '\n' || last2 != '\n')
            putchar(c);
        last2 = last1;
        last1 = c;
    }
    return 0;
}

The code skips sequences of more than 2 consecutive newline characters, hence it removes duplicate blank lines.