3

I have an app that writes that updates a disk file, but I want to make sure, as much as possible, that the previous version of the file doesn't get corrupted.

The most straight forward way to update a file, of course, is to simply write:

(spit "myfile.txt" mystring)

However, if the PC (or java process) dies in the middle of writing, this has a small chance of corrupting the file.

A better solution is probably to write:

(do (spit "tempfile" mystring)
    (.rename (file "tempfile") "myfile.txt")
    (delete-file "tempfile"))

This uses the java file rename function, which I gather is typically atomic when performed on a single storage device in most cases.

Do any Clojurians with some deeper knowledge of Clojure file IO have any advice on whether this is the best approach, or if there's a better way to minimize the risk of file corruption when updating a disk file?

Thanks!

drcode
  • 3,287
  • 2
  • 25
  • 29
  • Are you asking for a more idiomatic way to do the same tempfile-rename-delete path, or a more robust approach to maintaining consistency of file structures? – Alex Mar 04 '13 at 19:23
  • When it get's right down to it, I'm looking for the best way to maintain consistency of file structures (I'm building a small file-based database for a project and want to make sure I do file io correctly.) – drcode Mar 04 '13 at 19:55
  • Why delete tempfile? Assuming that `.rename` is equivalent to `mv` the source file should no longer exist. Also shouldn't that be `renameTo`? I don't see `rename()` in `java.io.File`. – Alex Jasmin Mar 05 '13 at 03:00

3 Answers3

2

This is not specific to Clojure; a temp-rename-delete scenario does not guarantee an atomic replace under the POSIX standard. This is due to the possibility of write reordering - the rename might get to the physical disk before the temp writes do, so when a power failure happens within this time window, data loss happens. This is not a purely theoretical possibility:

http://en.wikipedia.org/wiki/Ext4#Delayed_allocation_and_potential_data_loss

You need an fsync() after writing the temp file. This question discusses calling fsync() from Java.

Community
  • 1
  • 1
Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
  • Thanks. This looks like exactly what I needed to know. Let me wait until more feedback before marking your answer as "correct" – drcode Mar 04 '13 at 19:58
1

The example you give is to my understanding completely idiomatic and correct. I would just do a delete on tempfile first in case the previous run failed and add some error detection.

Arthur Ulfeldt
  • 90,827
  • 27
  • 201
  • 284
  • Thanks Arthur, this is really helpful. However, Rafal's similar, but more complex response edges yours out a tiny amount on technical detail. – drcode Mar 04 '13 at 20:03
1

Based on the feedback from your comment, I would recommend that you avoid trying to roll your own file-backed database, based on a couple of observations:

  • Persistent storage of data structures in the filesystem that is consistent in the case of crashes is a tough problem to solve. Lots of really smart people have spent lots of time thinking about this problem.
  • Small databases tend to grow into big databases and collect extra features over time. If you roll your own, you'll find yourself reinventing the wheel over the course of the project.

If you're truly interested in maintaining consistency of your application's data in the event of a crash, then I'd recommend you look at embedding one of the many freely available databases that are available - you could start by looking at Berkely DB, HyperSQL, or for one with a more Clojure flavor, Datomic.

Alex
  • 13,811
  • 1
  • 37
  • 50
  • Hi Alex: I think you're absolutely correct for almost all circumstances. However, I think I have a rare use case where an external database is impractical. (Part of the reason I don't know the answer to this question is precisely because I usually use an external DB as you recommend...) Thanks for your answer! – drcode Mar 04 '13 at 20:16
  • That's why I recommended an embedded database. For example, Berkely DB and HSQLDB can both provide in-process access to file-backed databases with no communication to external processes. – Alex Mar 04 '13 at 20:30
  • Interesting... you're right, that's probably what I need to be using. – drcode Mar 04 '13 at 21:05