Copying a huge file using read(), write() and open() API's in linux

Question

I have been learning system programming in linux and I'm trying to copy a video using read() and write (). The problem I'm facing is that I cannot save the entire file into a buffer since its a huge file.

I thought I could loop it in since I was using write with append flag but then how would I use it with read?

This is my messed up code. I would appreciate any help:

int main() {

    int movie_rdfd = open("Suits.mp4", O_RDONLY); //fd for read
    off_t file_length = (int)(fseek(movie_rdfd, 0, SEEK_END));

    printf("This is fd for open: %d\n", movie_rdfd); //line to be deleted
    char* Save[fseek(movie_rdfd, 0, SEEK_END)];

    int C1 = read(movie_rdfd, Save, );

    printf("Result of Read (C1): %d\n", C1); //line to be deleted

    int movie_wrfd = open("Suits_Copy.mp4", O_WRONLY|O_CREAT, 0644); //fd for write

    printf("This is result of open: %d\n", movie_wrfd); //line to be deleted

    int C2 = write(movie_wrfd, Save, fseek(movie_rdfd, 0, SEEK_END));

    printf("Result of Read (C2): %d\n", C2); //line to be deleted

    close(movie_rdfd);
    close(movie_wrfd);

    return 0;
}

Also its showing segmentation fault when I try to find file size

I'd start adding `if (movie_rdfd == -1) { perror("Error opending file: "); return 1}` just after open call to check if the file was opened correctly. — LPs, Feb 10 '17 at 09:45
Creating an array with non-const size compile time is non-standard as far as I know. Also I'm sure you don't want an array of pointers to chars. You want to read and write blocks and that's it. — Sami Kuhmonen, Feb 10 '17 at 09:46
@SamiKuhmonen It is [VLAs](https://gcc.gnu.org/onlinedocs/gcc/Variable-Length.html) — LPs, Feb 10 '17 at 09:49
@LPs Thanks, didn't know they were accepted in C, been away too much. — Sami Kuhmonen, Feb 10 '17 at 09:51
Using `mmap` with `PROT_READ` on `Suits.mp4` will ease your task significantly. — Blagovest Buyukliev, Feb 10 '17 at 09:56
The buffer should be `char`, not `char*`. Allocate a sufficient size, then just `read` and note the number returned. Then `write` the same amount of bytes. Repeat until end of file. — Bo Persson, Feb 10 '17 at 10:12
I'm not sure why `mmap()` is so often put forth as a solution for uses such as this. Read [this from one Linus Torvalds regarding `mmap()` - you may have heard of him.](http://marc.info/?l=linux-kernel&m=95496636207616&w=2) — Andrew Henle, Feb 10 '17 at 10:13
Possible duplicate of [Copy a file in a sane, safe and efficient way](http://stackoverflow.com/questions/10195343/copy-a-file-in-a-sane-safe-and-efficient-way) — jamek, Feb 10 '17 at 10:24
when using `open()` the code should also contain a call to `close()` — user3629249, Feb 12 '17 at 18:23
the posted code is missing the needed `#include` statements. — user3629249, Feb 12 '17 at 18:24
this line: `off_t file_length = (int)(fseek(movie_rdfd, 0, SEEK_END));` is not correct.: 1) the function `fseek()` returns an `int`, so no need to cast to `int` 2) the returned value is an `int`, so why try to force it into a `off_t` variable? — user3629249, Feb 12 '17 at 18:27
this line: `char* Save[fseek(movie_rdfd, 0, SEEK_END)];` is declaring a (VERY) large array of pointers to char. What you really want is an array of char, not an array of pointers to char. Note: a pointer to char is (with a 32 bit architecture) 4 bytes — user3629249, Feb 12 '17 at 18:30

score 4 · Answer 1 · answered Feb 10 '17 at 10:47

The proper logic for copying a file in POSIX.1 systems, including Linux, is roughly

Open source file
Open target file
Repeat:
    Read a chunk of data from source
    Write that chunk to target
Until no more data to read
Close source file
Close target file

Proper error handling adds a significant amount of code, but I consider it a necessity, not an optional thing to be added afterwards if one has the time to do so.

(I am so strict in this, that I'd fail anyone who omits error checking, even if their program otherwise functioned properly. The reason is basic sanity: A tool that may blow up in your hands is not a tool, it is a bomb. There are enough bombs in the software world already, and we don't need more "programmers" who create those. What we need are reliable tools.)

Here is an example implementation with proper error checking:

#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>

#define  DEFAULT_CHUNK  262144  /* 256k */

int copy_file(const char *target, const char *source, const size_t chunk)
{
    const size_t size = (chunk > 0) ? chunk : DEFAULT_CHUNK;
    char        *data, *ptr, *end;
    ssize_t      bytes;
    int          ifd, ofd, err;

    /* NULL and empty file names are invalid. */
    if (!target || !*target || !source || !*source)
        return EINVAL;

    ifd = open(source, O_RDONLY);
    if (ifd == -1)
        return errno;

    /* Create output file; fail if it exists (O_EXCL): */
    ofd = open(target, O_WRONLY | O_CREAT | O_EXCL, 0666);
    if (ofd == -1) {
        err = errno;
        close(ifd);
        return err;
    }

    /* Allocate temporary data buffer. */
    data = malloc(size);
    if (!data) {
        close(ifd);
        close(ofd);
        /* Remove output file. */
        unlink(target);
        return ENOMEM;
    }

    /* Copy loop. */
    while (1) {

        /* Read a new chunk. */
        bytes = read(ifd, data, size);
        if (bytes < 0) {
            if (bytes == -1)
                err = errno;
            else
                err = EIO;
            free(data);
            close(ifd);
            close(ofd);
            unlink(target);
            return err;
        } else
        if (bytes == 0)
            break;

        /* Write that same chunk. */
        ptr = data;
        end = data + bytes;
        while (ptr < end) {

            bytes = write(ofd, ptr, (size_t)(end - ptr));
            if (bytes <= 0) {
                if (bytes == -1)
                    err = errno;
                else
                    err = EIO;
                free(data);
                close(ifd);
                close(ofd);
                unlink(target);
                return err;
            } else
                ptr += bytes;
        }
    }

    free(data);

    err = 0;
    if (close(ifd))
        err = EIO;
    if (close(ofd))
        err = EIO;
    if (err) {
        unlink(target);
        return err;
    }

    return 0;
}

The function takes the target file name (to be created), source file name (to be read from), and optionally the preferred chunk size. If you supply 0, the default chunk size is used. On current Linux hardware, 256k chunk size should reach maximum throughput; smaller chunk size may lead to slower copy operation on some (big and fast) systems.

The chunk size should be a power of two, or a small multiple of a large power of two. Because the chunk size is chosen by the caller, it is dynamically allocated using malloc()/free(). Note that it is explicitly freed in error cases.

Because the target file is always created -- the function will fail, returning EEXIST if the target file already exists --, it is removed ("unlinked") if an error occurs, so that no partial file is left over in error cases. (It is a common bug to forget to free dynamically allocated data in the error path; this is often called "leaking memory".)

The exact usage for open(), read(), write(), close(), and unlink() can be found at the Linux man pages.

write() returns the number of bytes written, or -1 if an error occurs. (Note that I explicitly treat 0 and all negative values smaller than -1 as I/O errors, because they should not normally occur.)

read() returns the number of bytes read, -1 if an error occurs, or 0 if there is no more data.

Both read() and write() may return a short count; i.e., less than was requested. (In Linux, this does not happen for normal files on most local filesystems, but only an idiot relies on the above function to be used on such files. Handling short counts isn't that complex, as you can see from the above code.)

If you wanted to add a progress meter, for example using a callback function, for example

void progress(const char *target, const char *source,
              const off_t completed, const off_t total);

then it would make sense to add an fstat(ifd, &info) call before the loop (with struct stat info; and off_t copied;, the latter counting the number of bytes copied). That call too may fail or report info.st_size == 0, if the source is e.g. a named pipe instead of a normal file. This means that the total parameter might be zero, in which case the progress meter would display only the progress in bytes (completed), with the remaining amount unknown.

Linux does have a maximum IO size that, when exceeded, will result in a partial `read()`/`write()` even on regular files on any file system. IIRC, it's just a bit short of 2 GB, but I can't find a reference right now. So if you try to `write()` a 4GB buffer, the call will return something just short of 2GB. — Andrew Henle, Feb 10 '17 at 11:07
@AndrewHenle: I know, it's one page short of 2**32 bytes in all Linux architectures. I filed a bug (814296), a duplicate of RHSA-2012:0862, at the Red Hat Bugzilla in 2012, and the limit (kernel `MAX_RW_COUNT` defined in `include/linux/fs.h`) was added to the VFS layer in the Linux kernel `fs/read_write.c:vfs_write()` due to bugs like that. — Nominal Animal, Feb 10 '17 at 11:26
@NominalAnimal: the code can be shortened a bit if you isolate the error recovery calls (free, close, close, unlink, return) at the end of the function and put cascading goto labels in front of each operation. — Blagovest Buyukliev, Feb 10 '17 at 12:30
@BlagovestBuyukliev: Similar to error handling in the Linux kernel, yes. In this particular case, I think it is better to keep all the cleanup with the error checking code, for ease of understanding. — Nominal Animal, Feb 10 '17 at 14:21

score 3 · Answer 2 · answered Feb 10 '17 at 10:11

Here are some critiques, and then how I'd do it:

This is good:

int movie_rdfd = open("Suits.mp4", O_RDONLY); //fd for read

This is, ummm, not so good:

off_t file_length = (int)(fseek(movie_rdfd, 0, SEEK_END));

fseek() is for stdio-based FILE * pointers opened with fopen(), not int file descriptors from open. To get the size of a file opened with open(), use fstat():

struct stat sb;
int rc = fstat( movie_rdfd, &sb );

Now you know how big the file is. But if it's a really big file, it's not going to fit into memory, so this is bad:

char* Save[fseek(movie_rdfd, 0, SEEK_END)];

That's bad in multiple ways, too - it should be char Save[], not char *. But either way, for a really large file, it's not going to work - it's too big to put on the stack as a local variable.

And you don't want to read the whole thing at once, anyway - it likely won't work as you'll likely get a partial read. Per the read standard:

The read() function shall attempt to read nbyte bytes from the file associated with the open file descriptor ...

RETURN VALUE

Upon successful completion, these functions shall return a non-negative integer indicating the number of bytes actually read. ...

Note that it says "shall attempt to read" and it returns "the number of bytes actually read". So you have to handle partial reads with a loop anyway.

Here's one really simple way to copy a file using open(), read(), and write() (note that it really should have some more error checking - for example, the write() results should be checked to be sure they match the number of bytes read):

#define BUFSIZE ( 32UL * 1024UL )

char buffer[ BUFSIZE ];
int in = open( nameOfInputFile, O_RDONLY );
int out = open( nameOfOutputFile, O_WRONLY | O_CREAT | O_TRUNC, 0644 );

// break loop explicitly when read fails or hits EOF
for ( ;; )
{
    ssize_t bytesRead = read( in, buffer, sizeof( buffer ) );
    if ( bytesRead <= 0 )
    {
        break;
    }

    write( out, buffer, bytesRead );
}

Note that you don't even need to know how big the file is.

There are a lot of things you can do to make this a bit faster - they're usually not worth it as the above code will likely run at about 90% of maximum IO rate on most systems.

This is not entirely safe because `read(2)` can fail in non-EOF cases, too. Also, the result of `write` must be checked, too. — Blagovest Buyukliev, Feb 10 '17 at 10:33
@BlagovestBuyukliev *This is not entirely safe because `read(2)` can fail in non-EOF cases, too.* And when `read()` fails in non-EOF cases, what does it return? — Andrew Henle, Feb 10 '17 at 10:45
It still returns -1, but then, you have not correctly copied the file. Common non-EOF errors are `EINTR` and `EISDIR` (the file gets opened, but if it's a directory it fails on the first read). — Blagovest Buyukliev, Feb 10 '17 at 10:47
@BlagovestBuyukliev It's not meant to be a complete solution. I'd rather have the questioner learn something by doing it himself. — Andrew Henle, Feb 10 '17 at 10:52

Copying a huge file using read(), write() and open() API's in linux

2 Answers2