0

Okay, so I know that disk writing errors are very rare, so please just look past that because the data I am working with is very incredibly important (like SSIDs kind of important). So, I want to copy a file in the absolute most robust way using the absolute minimal amount of memory to do so. So far, this is was far as I have got. It sucks up a lot of memory, but I can't find the source. The way it works is by rechecking tons of times until it gets a confirmed result (it may increase the number of false positives for errors by a lot, but it might reduce the chance of an actual error by a big margin). Also, the sleep at the bottom is so you have time to analyze the programs overall performance using the windows task manager.


#include <cstdio>   // fopen, fclose, fread, fwrite, BUFSIZ
#include <cstdlib>
#include <unistd.h>
#include <iostream>

using namespace std;

__inline__ bool copy_file(const char* From, const char* To)
{
    FILE infile = (*fopen(From, "rb"));
    FILE outfile = (*fopen(To, "rwb+"));
    setvbuf( &infile, nullptr, _IONBF, 0);
    setvbuf( &outfile, nullptr, _IONBF, 0);

    fseek(&infile,0,SEEK_END);
    long int size = ftell(&infile);
    fseek(&infile,0,SEEK_SET);
    unsigned short error_amount;
    bool success;
    char c;
    char w;
    char l;

    for ( fpos_t i=0; (i != size); ++i ) {
        error_amount=0;
        fsetpos( &infile, &i );
        c = fgetc(&infile);
        fsetpos( &infile, &i );
        success=true;
        for ( l=0; (l != 126); ++l ) {
            fsetpos( &infile, &i );
            success = ( success == ( fgetc(&infile)==c ) );
        }
        while (success==false) {
            fsetpos( &infile, &i );
            if (error_amount==32767) {
                cerr << "There were 32768 failed attemps at accessing a part of the file! exiting the program...";
                return false;
            }
            ++error_amount;
            //cout << "an error has occured at position ";
            //printf("%d in the file.\n", (int)i);
            c = fgetc(&infile);
            fsetpos( &infile, &i );
            success=true;
            for ( l=0; (l != 126); ++l ) {
                fsetpos( &infile, &i );
                success = ( success == ( fgetc(&infile)==c ) );
            }
        }



        fsetpos( &infile, &i );
        fputc( c, &outfile);
        fsetpos( &outfile, &i );


        error_amount=0;
        w = fgetc(&infile);
        fsetpos( &outfile, &i );
        success=true;
        for ( l=0; (l != 126); ++l ) {
            fsetpos( &outfile, &i );
            success = ( success == ( fgetc(&outfile)==w ) );
        }
        while (success==false) {
            fsetpos( &outfile, &i );
            fputc( c, &outfile);
            if (error_amount==32767) {
                cerr << "There were 32768 failed attemps at writing to a part of the file! exiting the program...";
                return false;
            }
            ++error_amount;
            w = fgetc(&infile);
            fsetpos( &infile, &i );
            success=true;
            for ( l=0; (l != 126); ++l ) {
                fsetpos( &outfile, &i );
                success = ( success == ( fgetc(&outfile)==w ) );
            }
        }
        fsetpos( &infile, &i );
    }

    fclose(&infile);
    fclose(&outfile);

    return true;
}

int main( void )
{
    int CopyResult = copy_file("C:\\Users\\Admin\\Desktop\\example file.txt","C:\\Users\\Admin\\Desktop\\example copy.txt");

    std::cout << "Could it copy the file? " << CopyResult << '\n';

    sleep(65535);
    return 1;
}


So, if my code is on the right track with the best way, then what can be done with my code to improve it? But, if my code is totally off with the best solution, then what is the best solution? Please note that this question is essentially about detection of rare disk writing errors for the application of copying very very very very (etc.) important data.

Jack G
  • 4,553
  • 2
  • 41
  • 50
  • 2
    Since you have code that already works, you should move your question to [Code Review](http://codereview.stackexchange.com/). – dandan78 Jan 15 '16 at 13:29
  • 5
    An unrelated not about your use of macros: Don't "rename" common standard structures. Nobody knows what `file` is, everybody knows what `FILE` is. – Some programmer dude Jan 15 '16 at 13:30
  • 2
    Also, you don't know how e.g. `fopen` creates the structure pointer it returns. By dereferencing it and copy to your own structure you have possible *undefined behavior* when you call `fclose`. What if `fopen` calls `malloc` to create the memory, and `fclose` calls `free`? – Some programmer dude Jan 15 '16 at 13:31
  • 2
    Apart from the C++ style includes and the output, this is plain old C. What makes you tag this as a C++ question? – DevSolar Jan 15 '16 at 13:31
  • 1
    Why are you storing FILE objects rather than just using FILE*? – Martin Bonner supports Monica Jan 15 '16 at 13:31
  • 1
    I am tagging this as c++ because the answer can be in c++ – Jack G Jan 15 '16 at 13:33
  • 1
    @dandan78: but won't code review generate a lot of stuff along the lines of "don't use `32767` as a magic number in the source", which while true is completely irrelevant to the question? Or is it OK to ask a question on code review which is basically, "I don't want my code reviewed, but is there a better technique than this for copying files in C++?"? – Steve Jessop Jan 15 '16 at 13:35
  • 2
    Related: http://stackoverflow.com/q/10195343/5069029 – 301_Moved_Permanently Jan 15 '16 at 13:36
  • 1
    As for making it "robust", the common way is to make a *checksum* of the source file, do a very simple copy operation (just read and write a buffer), and then compare the source checksum with a checksum of the destination. – Some programmer dude Jan 15 '16 at 13:36
  • 1
    that question is not related because this question is about avoiding the rare disk writing error – Jack G Jan 15 '16 at 13:38
  • 1
    checksums are not completely reliable because of the pigeon-hole principle, so using them would defeat the purpose – Jack G Jan 15 '16 at 13:39
  • 1
    I beg to disagree. For all practical purposes, checksums are very reliable. Notice that git is based entirely on SHA-1 algorithm. If you have two files with the same SHA-1 checksums, git will think they are identical. The chances of this happening are smaller than the chance of an airplane landing on your house in the next 5 minutes. So why worry about it? – Adi Levin Jan 15 '16 at 13:43
  • 1
    Obviously you should read up on the pigeonhole principle. Sure, checksums would detect some errors, but it would still be statistically impossible for check-sums to detect most errors. – Jack G Jan 15 '16 at 13:51
  • 1
    Plz read-up on the pigeonhole principle and another one of my posts on this subject: http://math.stackexchange.com/questions/1539120/file-compression-statistics – Jack G Jan 15 '16 at 14:00
  • 1
    The idea that you can get a "reliable" copy without a proper Fault Mode Analysis is laughable. Worrying about collisions in a checksum, but ignoring the possibility of a write failure being masked by caching? The former is a statistical impossibility, the latter happens every few Terabytes. – MSalters Jan 15 '16 at 14:16

2 Answers2

2

I would just copy the file without any special checks, and in the end I would read the file and compare its hash value to the expected one. For a hash function, I would use MD5 or SHA-1.

Adi Levin
  • 5,165
  • 1
  • 17
  • 26
  • 1
    Checksums are not completely reliable because of the pigeonhole principle, so using them would defeat the purpose. – Jack G Jan 15 '16 at 13:41
  • 1
    Also, this question is more about avoiding the rare disk writing error. – Jack G Jan 15 '16 at 13:42
  • 1
    There is no 100% perfect solution. Every algorithm has a small chance of reporting that copy has succeeded, when in fact it failed. Even if you complete the write successfully, there is no perfect guarantee that a byte will not get corrupted a minute later. Therefore, I prefer a simple and well-understood solution which is based on standards, over complex code that is hard to understand, and which might or might not be more robust – Adi Levin Jan 15 '16 at 13:46
  • 1
    BINGO!!!!! You are absolutely correct!!!!!!!!!!!!!!! Essentially, what I am trying to do is greatly reduce the chance of incorrectly copying, and for that, check-sums are NOT the way to go. – Jack G Jan 15 '16 at 13:54
  • 1
    Plz read-up on the pigeonhole principle and another one of my posts on this subject: http://math.stackexchange.com/questions/1539120/file-compression-statistics – Jack G Jan 15 '16 at 13:57
  • 1
    After reading it, I think you'll agree on why checksums can't be too reliable. – Jack G Jan 15 '16 at 13:58
  • 2
    Thanks for the reference :-). I agree with you that it's not perfect, but checksums give you the mechanism for increasing the reliability as close as you want to 100%. If you think SHA-1 is not strong enough for your purposes, use a strong algorithm, or even use 10 different checksums. I doubt if there is any algorithm either of us can invent, that is stronger than that. – Adi Levin Jan 15 '16 at 14:01
1
#include <boost/filesystem.hpp>
#include <iostream>

int main()
{
    try
    {
        boost::filesystem::copy_file( "C:\\Users\\Admin\\Desktop\\example file.txt",
                                      "C:\\Users\\Admin\\Desktop\\example copy.txt" );
    }
    catch ( boost::filesystem::filesystem_error const & ex )
    {
        std::cerr << "Copy failed: " << ex.what();
    }
}

This will call the arguably most robust implementation available -- the one provided by the operating system -- and report any failure.


My point being:

The chance of having your saved data end up corrupted are astronomically small to begin with.

Any application where this might actually be an issue should be running on redundant storage, a.k.a. RAID arrays, filesystems doing checksums (like Btrfs, ZFS) etc., again reducing chance of failure significantly.

Doing complex things in home-grown I/O functions, on the other hand, increases the probability of mistakes and / or false negatives immensely.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • 1
    Very good, but what about detecting the rare disk writing errors? – Jack G Jan 15 '16 at 13:44
  • 1
    @lolzerywowzery: Disk errors are detected by the drive firmware, reported to the OS, reported to my program, reported to you as "Copy failed". You may try again. – DevSolar Jan 15 '16 at 13:51
  • 1
    @lolzerywowzery: It's **possible** that a natural disaster will hit your house and wipe out your hard drive. – DevSolar Jan 15 '16 at 14:03
  • 1
    I agree. If you're looking for really high durability, you have to replicate and distribute your data. Normal file systems are not designed for really high durability. But even with replication, there is never a perfect guarantee. – Adi Levin Jan 15 '16 at 14:04
  • 1
    If checksums work so well, then what about reading each byte of the file to compare the copied file, and the origional to ensure that they're the same? – Jack G Jan 15 '16 at 14:06
  • 2
    @lolzerywowzery: Checksums is what modern file systems (like ZFS, Btrfs) do, yes. *Don't do it the heavy-handed, dog-slow, and error-prone way displayed by your example code.* And **please** spare me that "pidgeonhole" argument. The chances of checksum collisions are smaller than you *and* me winning the lottery on the same day. – DevSolar Jan 15 '16 at 14:13
  • 2
    It's like trying to find the smallest number. There is no end to it. You should not do everything possible to reduce the chances of disk writing error, when disk write errors are much less likely than other aspect of system reliability (I'm sure that your system does more than just copy a single file, right?). At some point, you will have to acknowledge that the reliability is not perfect, but is high enough, and then focus your attention on other things instead. – Adi Levin Jan 15 '16 at 14:19