bash scripting de-dupe

Question

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.

This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.

Easiest way to do this?

Thanks!

c00kiemon5ter · Accepted Answer · 2011-06-12T16:03:52.617

5

Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt

The first time you do:

$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]

The next time it'll be like this:

$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.

Except if the file on the server was updated.

That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?

You only know that, make some tests.

So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):

newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
    xz -f file.txt # overwrite with the new compressed data
else
    rm file.txt
fi

and that's done;

edited Jun 12 '11 at 16:03

answered Jun 12 '11 at 14:39

c00kiemon5ter

16,994
7
46
48

I didn't actually know that. Thanks for the info - very useful. Unfortunately the server isn't providing any useful last-modified or etag headers. – aidan Jun 12 '11 at 14:59
When running `wget` with `-N`, the decision as to whether or not to download a newer copy of a file depends on the _local_ and _remote_ **timestamp** and **size** of the file. So if you know that the file grows everytime it's updated and can't have the same size, or if you believe the possibility of the file being updated and having the same size is too small, then you can still use that. – c00kiemon5ter Jun 12 '11 at 15:15
@aidan I edited my answer to provide a hash type solution in a bit, check if that suits you ;) – c00kiemon5ter Jun 12 '11 at 16:04
Thanks for the update. As the script is run as a cronjob, I can't hard code the filenames in like that (I need to keep all the old versions). But I could write out the filenames to a text file and read them back later. In the end I added a dirty perl script to the script! (detailed in one of my comments below). Nice use of tee! – aidan Jun 14 '11 at 15:40
well, you'd probably modify this to pass the filename as an argument to the script, or use it as a var, it's just a draft to play with. Yeah, `tee` is nice and not so widely used. I wanted to store the file and read the contents in one go so `tee` was good for the job :) haha, Perl is dirty for sure :D – c00kiemon5ter Jun 14 '11 at 15:50

score 1 · Answer 2 · edited Jun 12 '11 at 14:32

Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.

Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.

score 0 · Answer 3 · answered Jun 12 '11 at 14:20

0

How about downloading the file, and checking it against a "last saved" file?

For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.

Don't know if this would work well, but it's what I could think of.

answered Jun 12 '11 at 14:20

Ryan Leonard

997
1
9
26

I like this idea. I was hoping there was a way to de-dupe without having to store a pointer to the last file. But this'll work. – aidan Jun 12 '11 at 15:01
Screw it. I'll use perl. `perl -e '%x=(); for (<*>){$md5 = \`md5sum $_\`; next unless $md5 =~ /([0-9a-f]{32})/; \`rm $_\` if $x{$1}++}'` – aidan Jun 12 '11 at 15:06

score 0 · Answer 4 · answered Jun 12 '11 at 14:26

You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.

bash scripting de-dupe

4 Answers4