5

The scenario

I have a file server that acts as a master storage for the files to sync and I have several clients that have a local copy of the master storage. Each client may alter files from the master storage, add new ones or delete existing ones. I would like all of them to stay in sync as good as possible by regularly performing a sync operation, yet the only tool I have available everywhere for that is rsync and I can only run script code on the clients, not on the server.

The problem

rsync doesn't perform a bi-directional sync, so I have to sync from server to client as well as from client to server. This works okay for files that just changed by running two rsync operations but it fails when files have been added or deleted. If I don't use rsync with a delete option, clients cannot ever delete files as the sync from the server to the client restores them. If I use a delete option, then either the sync from server to client runs first and deletes all new files the client has added or the sync from client to server runs first and deletes all new files other clients have added to the server.

The question

Apparently rsync alone cannot handle that situation, since it is only supposted to bring one location in sync with another location. I surely neeed to write some code but I can only rely on POSIX shell scripting, which seems to make achieving my goals impossible. So can it even be done with rsync?

Mecki
  • 125,244
  • 33
  • 244
  • 253
  • 1
    Good stuff, but no '[bash]` tag? ;-) – shellter Oct 02 '20 at 13:43
  • @shellter why should it have a #bash ? – cregox May 13 '22 at 20:02
  • 1
    @cregox. 65K+ followers for bash, vs 33K for shell. Just more eyes on the problem. – shellter May 13 '22 at 22:22
  • @shellter sounds reasonable but then why don't you just add it yourself? – cregox May 14 '22 at 02:55
  • 1
    @cregox : I don't like modifying other's posts, but I'm happy to leave a comment ;-)! – shellter May 17 '22 at 22:55
  • @shellter oof, got your point. it reminds me of why i need to leave stack overflow... can't really put a name on it, but at some point the gasification lose the openness i need to see, such as in wiki and especially agpl, if that makes any sense! thanks for all your feedback and if you want to continue on this off topic, please do it on the fediverse: https://talk.ahoxus.org/notice/AJZ5JwzcAs9DLdtLJw – cregox May 18 '22 at 04:09

1 Answers1

10

What is required for this scenario are three sync operations and awareness of which files the local client has added/deleted since the last sync. This awareness is essential and establishes a state, which rsync doesn't have, as rsync is stateless; when it runs it knows nothing about previous or future operations. And yes, it can be done with some simple POSIX scripting.

We will assume three variables are set:

  1. metaDir is a directory where the client can persistently store files related to the sync operations; the content itself is not synced.

  2. localDir is the local copy of the files to be synced.

  3. remoteStorage is any valid rsync source/target (can be a mounted directory or an rsync protocol endpoint, with or w/o SSH tunneling).

After every successful sync, we create a file in the meta dir that lists all files in local dir, we need this to track files getting added or deleted in between two syncs. In case no such file exists, we have never ran a successful sync. In that case we just sync all files from remote storage, build such a file, and we are done:

filesAfterLastSync="$metaDir/files_after_last_sync.txt"

if [ ! -f "$metaDir/files_after_last_sync.txt" ]; then
    rsync -a "$remoteStorage/" "$localDir"
    ( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesAfterLastSync"
    exit 0
fi

Why ( cd "$localDir" && find . ) | sed "s/^\.//"? Files need to be rooted at $localDir for rsync later on. If a file $localDir/test.txt exists, the generated output file line must be /test.txt and nothing else. Without the cd and an absolute path for the find command, it would contain /..abspath../test.txt and without the sed it would contain ./test.txt. Why the explicit sort call? See further downwards.

If that isn't our initial sync, we should create a temporary directory that auto-deletes itself when the script terminates, no matter which way:

tmpDir=$( mktemp -d )
trap 'rm -rf "$tmpDir"' EXIT

Then we create a file list of all files currently in local dir:

filesForThisSync="$tmpDir/files_for_this_sync.txt"
( cd "$localDir" && find . ) | sed "s/^\.//" | sort  > "$filesForThisSync"

Now why is there that sort call? The reason is that I need the file list to be sorted below. Okay, but then why not telling find to sort the list? That's because find does not guarantee to sort the same was as sort does (that is explicitly documented on the man page) and I need exactly the order that sort produces.

Now we need to create two special file lists, one containing all files that were added since last sync and one that contains all files that were deleted since last sync. Doing so is a bit tricky with just POSIX but various possibility exists. Here's one of them:

newFiles="$tmpDir/files_added_since_last_sync.txt"
join -t "" -v 2 "$filesAfterLastSync" "$filesForThisSync" > "$newFiles"

deletedFiles="$tmpDir/files_removed_since_last_sync.txt"
join -t "" -v 1 "$filesAfterLastSync" "$filesForThisSync" > "$deletedFiles"

By setting the delimiter to an empty string, join compares whole lines. Usually the output would contain all lines that exists in both files but we instruct join to only output lines of one of the files that cannot be matched with the lines of the other file. Lines that only exist in the second file must be from files have been added and lines that only exist in the first file file must be from files that have been deleted. And that's why I use sort above as join can only work correctly if the lines were sorted by sort.

Finally we perform three sync operations. First we sync all new files to the remote storage to ensure these are not getting lost when we start working with delete operations:

rsync -aum --files-from="$newFiles" "$localDir/" "$remoteStorage"

What is -aum? -a means archive, which means sync recursive, keep symbolic links, keep file permissions, keep all timestamps, try to keep ownership and group and some other (it's a shortcut for -rlptgoD). -u means update, which means if a file already exists at the destination, only sync if the source file has a newer last modification date. -m means prune empty directories (you can leave it out, if that isn't desired).

Next we sync from remote storage to local with deletion, to get all changes and file deletions performed by other clients, yet we exclude the files that have been deleted locally, as otherwise those would get restored what we don't want:

rsync -aum --delete --exclude-from="$deletedFiles" "$remoteStorage/" "$localDir"

And finally we sync from local to remote storage with deletion, to update files that were changed locally and delete files that were deleted locally.

rsync -aum --delete "$localDir/" "$remoteStorage" 

Some people might think that this is too complicated and it can be done with just two syncs. First sync remote to local with deletion and exclude all files that were either added or deleted locally (that way we also only need to produce a single special file, which is even easier to produce). Then sync local to remote with deletion and exclude nothing. Yet this approach is faulty. It requires a third sync to be correct.

Consider this case: Client A created FileX but hasn't synced yet. Client B also creates FileX a bit later and syncs at once. When now client A performs the two syncs above, FileX on remote storage is newer and should replace FileX on client A but that won't happen. The first sync explicitly excludes FileX; it was added to client A and thus must be excluded to not be deleted by the first sync (client A cannot know that FileX was also added and uploaded to remote by client B). And the second one would only upload to remote and exclude FileX as the remote one is newer. After the sync, client A has an outdated FileX, despite the fact, that an updated one existed on remote.

To fix that, a third sync from remote to local without any exclusion is required. So you would also end up with a three sync operations and compared to the three ones I presented above, I think the ones above are always equally fast and sometimes even faster, so I would prefer the ones above, however, the choice is yours. Also if you don't need to support that edge case, you can skip the last sync operation. The problem will then resolve automatically on next sync.

Before the script quits, don't forget to update our file list for the next sync:

 ( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesAfterLastSync"

Finally, --delete implies --delete-before or --delete-during, depending on your version of rsync. You may prefer another or explicit specified delete operation.

Mecki
  • 125,244
  • 33
  • 244
  • 253
  • 2
    Thanks for the very useful script! Is it maybe missing a closing rewrite of the meta data? `( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesAfterLastSync"` – JLP Labs May 05 '21 at 18:26
  • 1
    @JLPLabs You are correct. Well, I wrote about it at the very beginning of the script but didn't actually mention it again in the end. I fixed that. Thanks for tor the hint. – Mecki May 05 '21 at 21:03
  • is there a way to adapt this to implement a true 2-way sync using only `rsync` (which has been impossible thus far from what i've seen except when adapted into another program like `unison`)? which i guess would take into account adds/deletes on BOTH local and remote instead of just local. then only update/add/delete files based on when it happened on either local or remote. eg: file X deleted on local, then later file X modified on remote; transfer would put modified X back on local. – Leo Jun 15 '21 at 20:22
  • 1
    @Leo My suggestion above does take deletes of both sides into account. Files that used to be there on the server the last time you synced and are gone now will be deleted locally. And files that used to be locally the last time you synced and are missing now will be deleted from the server. I use that script to sync two computers over a cloud storage and deleting a file on either one makes it vanish from the other one the next time I sync. – Mecki Jun 16 '21 at 00:35
  • @Mecki Sorry, must have read/followed it wrong! Are you aware of any issues with this method using`--update`, `--delete` and `--backup` for archival purposes, or `--link-dest` and `--hard-links` for tracking moved/renamed files? (those functionalities would also be required for my use case) – Leo Jun 16 '21 at 17:03
  • @Leo haven't tried it in combination with hard links. Moved/renamed files are treated like new files being added and old files being deleted by my script. I only know of one issue my solution has in practice: I cannot deal with conflicts. Create a new file on both clients with same name but with entirely different content, only the later created one will be found on both clients in the end, the other one is "silently" lost. Or alter the beginning of a text file on one client and the end of it on another one. After syncing only the change performed later in time will stay, the other one is lost – Mecki Jun 16 '21 at 22:46
  • @mecki did you know even git can't properly handle conflicts? anyway, syncthing have a very simple conflict solution for files: use a special trashcan folder. i even prefer the versioning trashcan, if it used more folders instead of renaming the versions. do you think we could make something similar still keeping it simple? – cregox May 13 '22 at 20:13