Stuck in writing my diff utility in C++

Question

I was trying to use what i have learned about file and resource handling in C++: I would like to write a diff-like utility.

Here It is my latest version

#include <iostream>
#include <cstdlib>
#include <fstream>

int main(int argc, char* argv[])
{
  if(argc!=3)
  {
    std::cout << "error: 2 arguments required, now exiting ..." << std::endl;
    exit (EXIT_FAILURE);
  }

  std::ifstream file_1(argv[1]);
  std::ifstream file_2(argv[2]);

  if( file_1.fail() || file_2.fail() )
  {
    std::cout << "error: can't open files, now exiting ..." << std::endl;
    exit (EXIT_FAILURE);
  }

  std::string dummy_1;
  std::string dummy_2;

  while(!file_1.eof()) // dummy condition
  {
    std::getline(file_1,dummy_1);
    std::getline(file_2,dummy_2);
    std::cout << ((dummy_1==dummy_2) ? "= " : "# ") << dummy_1 << std::endl << "  " << dummy_2 << std::endl;
  }

  return(0);
}

This are my guidelines:

compare 2 files
the user must pass the names of this 2 files directly to the executable, only this 2 arguments
to cover as much error handling as possible in C++
try to avoid platform specific steps or non-portable code

My actual problem is that i don't know how to improve my dummy condition effectively. For now the while iteration just follows the length of the first passed file and I would like to obiviously go all the way down in both files and solve this without introducing an overkill like an extra cicle to get and compare the length of this 2 files before doing the real comparison.

I also would like to know if my approach can be considered safe.

Eventually I could also accept answers proposing a solution with the boost libraries since they are quite portable and I already know that i will use them for other reasons.

Thanks.

What you've written is a great start. You may want to provide more detailed error messages, for example by checking `fail()` separately for each file and telling the user which one couldn't be opened, or by telling the user which file ended first. — Adam Liss, Nov 11 '12 at 15:02
@AdamLiss I have also noticed that the `exit()` function is part of the C standard library for C++, aka `cstdlib`, it's this a real C++ way of handling the exit from my program ? My main problem here is to be sure that the application will work and it's portable, I'm lazy and I'm delaying a bunch of extra checks and `std::cout` intentionally :! — user1802174, Nov 11 '12 at 15:13
@user1802174: `exit()` is fine if you want to explicitly terminate the program from any point, and also gives you the option of exiting with a numeric code, which could be read from batch files under Windows. — Jonathan Wood, Nov 11 '12 at 15:22
The diff of two text files is a non trivial problem. http://en.wikipedia.org/wiki/Diff — Martin York, Nov 11 '12 at 15:23
@LokiAstari that's why i would like to start with an in depth analysis about this, i will have to handle a lot of files and I was also curious about why this looks trivial but it's not. — user1802174, Nov 11 '12 at 15:38

score 3 · Answer 1 · edited May 23 '17 at 10:24

3

As usual eof() is the wrong thing to do. This works

while (std::getline(file_1, dummy_1) && std::getline(file_2, dummy_2))
{
    ...
}

Suggest you read up on what eof() really does. it doesn't to what you think, but actually it will be useful in this program because you can use it the proper way, to tell which of your two files has hit the end of file. See here

You can actally use eof() properly in this program to find out which of the two files hit end of file. I would probably write your loop something like this

for (;;)
{
    getline(file_1, dummy_1);
    getline(file_2, dummy_2);
    if (file_1.eof() || file_2.eof())
        break;
    ...
}
if (file_1.eof() && file_2.eof())
{
    // both at end of file
}
else if (file_1.eof())
{
    // file 1 at end of file
}
else
{
    // file 2 at end of file
}

Notice though that the eof() test comes after the getline(), not before. That's how eof() is supposed to be used.

edited May 23 '17 at 10:24

Community

1
1

answered Nov 11 '12 at 15:00

john

85,011
4
57
81

@AdamLiss One day one of the newbie posters using eof() wrongly will explain to me why they do it. I've tried asking nicely, with detailed explanations, it doesn't work. It is incredible how so many people make the *same* basic mistake, there must be a reason for it, I just want to know what it is. – john Nov 11 '12 at 15:07
how getline iterates ? it has its own implicit iterator ? – user1802174 Nov 11 '12 at 15:10
in my case it's probably just how it sound "eof = end of file" :! – user1802174 Nov 11 '12 at 15:10
In this case I disagree. This is one of the rare situations where it may be better to use eof() in the test (because we need to loop over two files and continue even if one finishes first). – Martin York Nov 11 '12 at 15:12
eof() tests the current state of the stream. It tells you if the *last* thing you did on the stream failed because of an end of file. It does not tell you if the *next* you do on the stream will fail because of end of file. That is how you (and almost every other neebie) were trying to use it. – john Nov 11 '12 at 15:12
@LokiAstari I would say the eof() test should come after this loop. You can use it to find out which of the two files has hit end of file. This program has a legitimate use for eof(), just not in the way the original code was. – john Nov 11 '12 at 15:14
Now i can see why eof is a no-go most of times, but your code forks a lot, the concept is clear, the implementation ... the getline solution looks more "handy" and clean. – user1802174 Nov 11 '12 at 15:41

score 3 · Answer 2 · answered Nov 11 '12 at 15:15

As john pointed out. Using eof() in the condition is usually wrong.

But in this case I think it is appropriate. But as a result you need to add some extra checks.

while(true)  // exit provided by break.
{
    std::string dummy_1;   // By declaring them here you force them to be 
    std::string dummy_2;   // reset each iteration.

    // Because you are doing the read inside the loop
    // You need to check if the reads work.
    if (!std::getline(file_1,dummy_1) && !std::getline(file_2,dummy_2))
    {
        // Only exit if both reads fail.
        break;
    }

    // Got here if at least one read worked.
    // A failed read will result in an empty line for comparison.    
    std::cout << ((dummy_1==dummy_2) ? "= " : "# ") << dummy_1 << std::endl << "  " << dummy_2 << std::endl;
}

Jerry Coffin · Accepted Answer · 2012-11-12T18:58:32.527

I started by writing rather a long comment on @Loki Astari's answer, but it's long enough (and, IMO, enough cleaner way to do the job) that it probably makes the most sense as an independent answer. In this case, you want something close to the standard loop, except that you keep reading as long as a read from one of the files succeeds. That being the case, @john is right, and it's best to avoid using eof() as part of the loop condition.

std::string line1, line2;
static const char *prefixes[] = {"#  ", "=  "};

~~while (std::getline(file_1, line1) || std::getline(file_2, line2)) std::cout << prefixes[line1==line2] << line1 << "\n " << line2 << "\n";~~

Edit: @user1802174 raised a good point -- as it was, the loop didn't actually read data in parallel at all. Since it was using || which does short-circuit evaluation, when/if the read from the first file succeeded, it didn't read anything from the second file. Fortunately, he was wrong about one thing: it is fairly easy to fix. At least in this case, + works fine, although we do have to explicitly cast the result to bool. I've also added a fix for the fact that upon failure, getline leaves the previous content of the string intact, so we need to explicitly clear the strings every iteration of the loop to get the desired behavior.

while (line1.clear(), line2.clear(), 
      (bool)std::getline(file_1, line1) + (bool)std::getline(file_2, line2))
{
    std::cout << prefixes[line1==line2] << line1 << "\n   " << line2 << "\n";
}

This time I did a quick test. File 1:

line1
line 2

File 2:

line 1
line 2
line 3

result:

#  line1
   line 1
=  line 2
   line 2
#
   line 3

While obviously still not a full-blown diff utility, I think this is doing what was intended.

As in @Loki Astari's answer, this will basically act as if the file with fewer lines was padded with as many empty lines at the end as necessary to match the longer file.

As an aside, also note the use of "\n" instead of std::endl. Along with inserting a new-line, std::endl also flushes the output buffer, which you almost certainly don't want in this case. Flushing the buffer still produces the correct results, but in many cases is likely to do so much more slowly.

Edit: As far as coding style goes, it probably is a bit better to write the loop as a for loop instead of a while:

for ( ; (bool)std::getline(file_1, line1) + (bool)std::getline(file_2, line2))
      ; line1.clear(), line2.clear())
{
    std::cout << prefixes[line1==line2] << line1 << "\n   " << line2 << "\n";
}

I personally see little real gain from using C++ style casts here. If I wanted to get away from using (bool), I'd probably use another well-known idiom (which, admittedly, many people also dislike):

for ( ; !!std::getline(file_1, line1) + !!std::getline(file_2, line2))
      ; line1.clear(), line2.clear())
{
    std::cout << prefixes[line1==line2] << line1 << "\n   " << line2 << "\n";
}

If somebody really objects to using a comma operator, this is easy to rewrite as:

while (!!std::getline(file_1, line1) + !!std::getline(file_2, line2))       
{
    std::cout << prefixes[line1==line2] << line1 << "\n   " << line2 << "\n";
    line1.clear();
    line2.clear();
}

Personally, I don't consider that an improvement, but others may disagree.

your solution generates this problem http://stackoverflow.com/questions/13343387/or-operator-doesnt-evaluate-both-statements-only-1 and looks like it's not "fixable" — user1802174, Nov 12 '12 at 12:10
@user1802174: You're half right anyway -- it was wrong, but it is fixable (see revised answer). — Jerry Coffin, Nov 12 '12 at 14:37
Your code works. But from both an aesthetic nature and code maintenance perspective I think you have pushed too much work into the condition (thus making it less readable). Its a code smell. A possible solution is to refactor the test into a separate function. Please no C-casts and lets not find the comma operator useful (its just a cramming tool). Currently if you had to run a debugger (in step mode) over that you would have a hard time validating it worked. — Martin York, Nov 12 '12 at 17:43
I'll admit I don't particularly care for the code exactly as-is, but don't find comma operators particularly objectionable. I don't like casting in general, but don't honestly consider C++ style casts any major improvement over C-style. — Jerry Coffin, Nov 12 '12 at 18:51

Stuck in writing my diff utility in C++

3 Answers3