3

I'm looking to overwrite the contents of a file that already exists. Obviously I could just create a new file that overwrites the old one, as per this answer. However, I'm writing a program that is going to be doing this potentially quite a few times, and I want to cut down unnecessary overhead as much as possible.

So my question is: is there a better way of simply rewriting the contents of the file itself, instead of writing a 'new' file that replaces the old one? Or is the overhead of simply overwriting the entire file contents roughly equivalent to the overhead of creating the new file and then writing to it? (For the record, these files are only 1 KB large.)

Community
  • 1
  • 1
MattS
  • 717
  • 2
  • 7
  • 22
  • 1
    You do understand how files are stored on disk, and how this limits how they can be handled, correct? – Hovercraft Full Of Eels Jun 28 '12 at 22:35
  • I'm not sure I understand what you're asking. I have a degree of understanding, but I wouldn't say I'm an expert on on-disk data storage. I'm a computer science student (this isn't homework, though) between my second and third years, so we've covered file storage a bit, but not a ton in my classes. – MattS Jun 28 '12 at 22:38
  • 1
    How about using a database or key-value store like [Redis](http://redis.io/) instead of files? – Christopher Peisert Jun 28 '12 at 22:39
  • @cpeisert Unfortunately I'm a bit limited in what I can do - my program invokes another program that takes a certain kind of formatted file as input. So I need to be writing a set of files, invoking this second program on each file, getting the results, changing the file contents, and doing it again; it's part of a genetic algorithm. However, I'm unable to change the other program at all, so I basically have to treat it like a black box. – MattS Jun 28 '12 at 22:42
  • You're worried about kilobyte-sized files? Just benchmark the easiest option to implement and see if that's good enough… – Donal Fellows Jun 28 '12 at 22:45
  • 1
    @Donal: When you've got a few million of them, it might make sense to be worried... remember how much effort went into optimizing filesystems and mount options for NNTP servers? :) – sarnold Jun 28 '12 at 22:50

3 Answers3

4

The short answer: write both and profile.

The longer answer with considerable hand-waving:

Overwriting a file will involve the following system calls:

open
write
close

Creating a new file, deleting the old file, and renaming the new file will involve the following system calls:

open
write
close
unlink
rename

System calls are often the slowest part of programs; in general, reducing system calls is a good way to speed a program. Overwriting the one file will re-use the operating system's internal directory entry data; this will probably also lead to some speed improvements. (They may be difficult to measure in a language with VM overhead...)

Your files are small enough that each write() should be handled atomically, assuming you're updating the entire 1K in a single write. (Since you care about performance, this seems like a safe assumption.) This does mean that other processes should not see partial writes except in the case of catastrophic powerfailures and lossy mount options. (Not common.) The file re-name approach does give consistent files even in the face of multiple writes.

However, 1K files are a pretty inefficient storage mechanism; many filesystems will write files along 4k blocks. If these data blocks exist only in your application it might make sense to write them in containers of some sort, several at a time. (Quake-derived systems do this for reading their maps, textures, and so forth, out of zip files, because giant streaming IO requests are far faster than thousands of smaller IO requests.) Of course, this is harder if your application is writing these files for other applications to work with, but it might still be worth investigating if the files are rarely shared.

sarnold
  • 102,305
  • 22
  • 181
  • 238
1

You can use RandomAccessFile here is a short sample:

         // create a new RandomAccessFile with filename test
     RandomAccessFile raf = new RandomAccessFile("c:/test.txt", "rw");

     // write something in the file
     raf.writeUTF("Hello World");

     // set the file pointer at 0 position
     raf.seek(0);

     // print the string
     System.out.println("" + raf.readUTF());

     // print current length
     System.out.println("" + raf.length());

     // set the file length to 30
     raf.setLength(30);

     // print the new length
     System.out.println("" + raf.length());
ialiashkevich
  • 615
  • 7
  • 8
  • But only if you're accessing data in a random-access way, and only if you're replacing precise lengths of data with data of the exact same lengths. – Hovercraft Full Of Eels Jun 28 '12 at 23:00
  • @ialiashkevich Awesome, this is exactly the alternative I was looking for. The files are always of roughly the same length (they're basically just lines with numbers in them in a certain pattern), so this seems like it could work. But now, the original question: what has less overhead - your method, or the method of the answer I linked to in the original post? – MattS Jun 29 '12 at 00:53
  • If you rewrite just part of the file, then RandomAccessFile has less overhead. Rewriting whole file will work same as the answer to the original post.
    Since your program invokes another program that takes a certain kind of formatted file as input, you should not bother about file writing overhead, invoking another program will take most resources of your system.
    I would recommend to write files and invoke the program simultaneously in multiple threads, in that case you can get maximum performance out of the hardware.
    – ialiashkevich Jun 29 '12 at 23:13
0

Just use it as an example in the linked answer. Let the OS/Filesystem worry about unlinking/linking inodes, locations on disk, etc. There's rarely a good reason these days to worry about it for the vast majority of software development.

In general, there won't be very much overhead that isn't eclipsed by CPU/disk i/o. If you're concerned about the disk i/o, use a memory filesystem (provided you don't need to keep the files around in the event of a crash) or very fast SSD on SATA3.

Drizzt321
  • 993
  • 13
  • 27