3

In my code, I have a situation where I need to copy data from one file to another. The solution I came up with looks like this:

const int BUF_SIZE = 1024;
char buf[BUF_SIZE];

int left_to_copy = toCopy;
while(left_to_copy > BUF_SIZE)
{
    fread(buf, BUF_SIZE, 1, fin);
    fwrite(buf, BUF_SIZE, 1, fout);
    left_to_copy -= BUF_SIZE;
}

fread(buf, left_to_copy, 1, fin);
fwrite(buf, left_to_copy, 1, fout);

My main thought was that there might be something like memcpy, but for data in files. I just give it two file streams and the total number of bytes. I searched a bit but I couldn't find any such thing.

But if something like that isn't available, what buffer size should I use to make the transfer fastest? Bigger would mean fewer system calls, but I figured it could mess up other buffering or caching on the system. Should I dynamically allocate the buffer so it only takes on pair of read/write calls? Typical transfer sizes in this particular case are from a few KB to a dozen or so MB.

EDIT: For OS specific information, we're using Linux.

EDIT2:

I tried using sendfile, but it didn't work. It seemed to write the right amount of data, but it was garbage.

I replaced my example above with something that looks like this:

fflush(fin);
fflush(fout);
off_t offset = ftello64(fin);
sendfile(fileno(fout), fileno(fin), &offset, toCopy);
fseeko64(fin, offset, SEEK_SET);

I added the flush, offest, and seeking one at a time since it didn't appear to be working.

stands2reason
  • 672
  • 2
  • 7
  • 18
  • 6
    It certainly seems like the fastest way to do this is going to involve OS dependent APIs. –  May 10 '12 at 22:56
  • 4
    What's wrong with the simple `ifstream i; ofstream o; /*open both*/; o << i.rdbuf();`? Guaranteed portability isn't something to ignore... – ildjarn May 10 '12 at 22:58
  • fread() and fwrite() have return values. You should check/use them. – Ras May 10 '12 at 23:04
  • @ildjarn. Sorry if it wasn't clear from the example. In the program, fin and fout are both already positioned to copy a specific chunk of data of `toCopy` bytes. – stands2reason May 11 '12 at 03:02
  • That's not relevant to my point. Do you mean to say that you want a _partial_ file copy rather than a full file copy? – ildjarn May 11 '12 at 03:04
  • To add to @Ras point, what if the return value of `fread()` is less than `BUF_SIZE`? You would be writing rubbish bytes to `fout` in that case. Also, your code effectively rounds up the `toCopy` size to the next multiple of `BUF_SIZE`. Depending on your use, this may or may not be a problem for you. – Greg Hewgill May 11 '12 at 05:21

4 Answers4

12

You need to tell us your (desired) OS. The appropriate calls (or rather best fitting calls) will be very system specific.

In Linux/*BSD/Mac you would use sendfile(2), which handles the copying in kernel space.

SYNOPSIS

 #include <sys/sendfile.h>
 ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

DESCRIPTION

sendfile() copies data between one file descriptor and another.  Because this
copying is done within the kernel, sendfile() is more efficient than the
combination of read(2) and write(2), which would require transferring data to
and from user space.

in_fd should be a file descriptor opened for reading and out_fd should be a
descriptor opened for writing.

Further reading:

Server portion of sendfile example:

/*

Server portion of sendfile example.

usage: server [port]

Copyright (C) 2003 Jeff Tranter.


This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

*/


#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/sendfile.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <netinet/in.h>


int main(int argc, char **argv)
{
  int port = 1234;           /* port number to use */
  int sock;                  /* socket desciptor */
  int desc;                  /* file descriptor for socket */
  int fd;                    /* file descriptor for file to send */
  struct sockaddr_in addr;   /* socket parameters for bind */
  struct sockaddr_in addr1;  /* socket parameters for accept */
  int    addrlen;            /* argument to accept */
  struct stat stat_buf;      /* argument to fstat */
  off_t offset = 0;          /* file offset */
  char filename[PATH_MAX];   /* filename to send */
  int rc;                    /* holds return code of system calls */

  /* check command line arguments, handling an optional port number */
  if (argc == 2) {
    port = atoi(argv[1]);
    if (port <= 0) {
      fprintf(stderr, "invalid port: %s\n", argv[1]);
      exit(1);
    }
  } else if (argc != 1) {
    fprintf(stderr, "usage: %s [port]\n", argv[0]);
    exit(1);
  }

  /* create Internet domain socket */
  sock = socket(AF_INET, SOCK_STREAM, 0);
  if (sock == -1) {
    fprintf(stderr, "unable to create socket: %s\n", strerror(errno));
    exit(1);
  }

  /* fill in socket structure */
  memset(&addr, 0, sizeof(addr));
  addr.sin_family = AF_INET;
  addr.sin_addr.s_addr = INADDR_ANY;
  addr.sin_port = htons(port);

  /* bind socket to the port */
  rc =  bind(sock, (struct sockaddr *)&addr, sizeof(addr));
  if (rc == -1) {
    fprintf(stderr, "unable to bind to socket: %s\n", strerror(errno));
    exit(1);
  }

  /* listen for clients on the socket */
  rc = listen(sock, 1);
  if (rc == -1) {
    fprintf(stderr, "listen failed: %s\n", strerror(errno));
    exit(1);
  }

  while (1) {

    /* wait for a client to connect */
    desc = accept(sock, (struct sockaddr *)  &addr1, &addrlen);
    if (desc == -1) {
      fprintf(stderr, "accept failed: %s\n", strerror(errno));
      exit(1);
    }

    /* get the file name from the client */
    rc = recv(desc, filename, sizeof(filename), 0);
    if (rc == -1) {
      fprintf(stderr, "recv failed: %s\n", strerror(errno));
      exit(1);
    }

    /* null terminate and strip any \r and \n from filename */
        filename[rc] = '\0';
    if (filename[strlen(filename)-1] == '\n')
      filename[strlen(filename)-1] = '\0';
    if (filename[strlen(filename)-1] == '\r')
      filename[strlen(filename)-1] = '\0';

    /* exit server if filename is "quit" */
    if (strcmp(filename, "quit") == 0) {
      fprintf(stderr, "quit command received, shutting down server\n");
      break;
    }

    fprintf(stderr, "received request to send file %s\n", filename);

    /* open the file to be sent */
    fd = open(filename, O_RDONLY);
    if (fd == -1) {
      fprintf(stderr, "unable to open '%s': %s\n", filename, strerror(errno));
      exit(1);
    }

    /* get the size of the file to be sent */
    fstat(fd, &stat_buf);

    /* copy file using sendfile */
    offset = 0;
    rc = sendfile (desc, fd, &offset, stat_buf.st_size);
    if (rc == -1) {
      fprintf(stderr, "error from sendfile: %s\n", strerror(errno));
      exit(1);
    }
    if (rc != stat_buf.st_size) {
      fprintf(stderr, "incomplete transfer from sendfile: %d of %d bytes\n",
              rc,
              (int)stat_buf.st_size);
      exit(1);
    }

    /* close descriptor for file that was sent */
    close(fd);

    /* close socket descriptor */
    close(desc);
  }

  /* close socket */
  close(sock);
  return 0;
}
Kijewski
  • 25,517
  • 12
  • 101
  • 143
  • For Linux, this sounds like exactly what I'm looking for. I can try using it next week and see how it works. – stands2reason May 11 '12 at 02:57
  • Did you try sendfile? There is practically no way a read/write loop can be near as fast as sendfile. Google up "sendfile" and see how many widely used system work on replacing their read/write loops with sendfile … – Kijewski May 15 '12 at 08:08
  • @kay, I would appreciate the example you showed in the third link, but it returns "Page not found". Could you provide a new link containing the same example? Thank you! – silvioprog Apr 26 '19 at 05:10
  • @silvioprog, thank you for notifying me! I copied the code verbatim from https://web.archive.org/ – Kijewski Apr 26 '19 at 12:18
2

One thing you could do is increase the size of your buffer. That could help if you have large files.

Another thing is to call directly to the OS, whatever that may be in your case. There is some overhead in fread() and fwrite().

If you could use unbuffered routines and provider your own larger buffer, you may see some noticeable performance improvements.

I'd recommend getting the number of bytes written from the return value from fread() to track when you're done.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • I will be trying a larger buffer as well. Is there a recommended size for shoveling data around like this? Also, I don't see how you can use "the number of bytes written from the return value" to help. My case uses a loop while the amount remaining is greater than the size of the buffer. If I gave it a larger number that would just overflow the buffer. – stands2reason May 11 '12 at 02:59
  • The best size for the buffer depends on how much memory is available and the size of the files being copied. You'll need to play with it. I've allocated as much as a half meg in the past (dynamically allocated, of course). You could simply subtract the value returned by `fread()` from number of bytes to copy. Even easier, you could just loop until `fread()` returns less than `BUF_SIZE`. In that case, you wouldn't even need to determine the size of the file. It doesn't make much difference but it just looked odd to me. – Jonathan Wood May 11 '12 at 04:25
  • 1
    OK, so I did the profiling, and I found that bigger was better: I used the time command and found 1m31 for the 1K buffer, 59s for a 1M buffer, and 26s for a 12M buffer (dynamically allocated). It looks like making the buffer about as big as the biggest thing you plan on copying is fastest, at least if the buffer size relative to amount of RAM is such that swapping won't be an issue. – stands2reason May 14 '12 at 14:15
2

It might be worthwhile considering memory-mapped file I/O for your target operating system. For the file sizes you are talking about, this is a viable way to go, and the OS will optimize better than you can do. If you want to write portable-OS code, this may not be the best approach, though.

This will require some setting up, but once you have got it set up, you can forget about loop code & it will basically look like a memcpy.

George D Girton
  • 763
  • 10
  • 15
0

As far as fast reading is considered i thing you can also opt mapping of files - Memory mapped I/O using mmap (see manual page for mmap). It is considered to be more efficient as compared to conventional I/O especially while dealing with large files.

mmap doesn't actually read the file. It just maps it to address space. That's why it's so fast, there is no disc I/O until you actually access that region of address space.

Or you can see the block size first and according to that u can proceed reading , that is also considered efficient because compiler enhances the optimization in that case.

john
  • 1,323
  • 2
  • 19
  • 31