How to copy only the changed file-contents on the already existed destination file?

Question

I have a script which i'm using for copy purpose from one location to another location and the file beneath the directory structure are all .txt files.

This script just evaluates the file size on the source and only copy if the file-size is not zero byte. However, I need to run this script in a cron after a certain intervals to copy the any incremented data.

So, I need to know how to copy only the file content which are updated on the source file and then update the destination only for the new-contents and not just overwrite if its already present at destination.

Code:

#!/bin/python3
import os
import glob
import shutil
import datetime

def Copy_Logs():
    Info_month = datetime.datetime.now().strftime("%B")
    # The result of the below glob _is_ a full path
    for filename in glob.glob("/data1/logs/{0}/*/*.txt".format(Info_month)):
        if os.path.getsize(filename) > 0:
            if not os.path.exists("/data2/logs/" + os.path.basename(filename)):
                shutil.copy(filename, "/data2/logs/")

if __name__ == '__main__':
    Copy_Logs()

I'm looking if there is way to use shutil() in way rsync works or if there is an alternative way to the code I have.

In a nutshell I need to copy only files ones if it's not already copied and then only copy the delta if source gets updated.

Note: The Info_month = datetime.datetime.now().strftime("%B") is mandatory to keep as this determines the current directory by month name.

Edit:

Just having another raw idea if we can use filecmp with shutil.copyfile module to compare files and directories but i'm not getting how to fit that into the code.

import os
import glob
import filecmp
import shutil
import datetime

def Copy_Logs():
    Info_month = datetime.datetime.now().strftime("%B")
    for filename in glob.glob("/data1/logs/{0}/*/*.txt".format(Info_month)):
        if os.path.getsize(filename) > 0:
            if not os.path.exists("/data2/logs/" + os.path.basename(filename)) or not filecmp.cmp("/data2/logs/" + os.path.basename(filename), "/data2/logs/"):
                shutil.copyfile(filename, "/data2/logs/")

if __name__ == '__main__':
    Copy_Logs()

What about running `rsync` through [subprocess](https://stackoverflow.com/a/18482169/1113207)? — Mikhail Gerasimov, Jan 20 '19 at 14:43

y.luis.rojo · Answer 1 · 2019-01-20T16:23:17.403

3

You could use Google's Diff Match Patch (you can install it with pip install diff-match-patch) to create a diff and apply a patch from it:

import diff_match_patch as dmp_module

#...
if not os.path.exists("/data2/logs/" + os.path.basename(filename)):
    shutil.copy(filename, "/data2/logs/")
else:
    with open(filename) as src, open("/data2/logs/" + os.path.basename(filename),
                                                                        'r+') as dst:
        dmp = dmp_module.diff_match_patch()

        src_text = src.read()
        dst_text = dst.read()

        diff = dmp.diff_main(dst_text, src_text)

        if len(diff) == 1 and diff[0][0] == 0:
            # No changes
            continue

        #make patch
        patch = dmp.patch_make(dst_text, diff)
        #apply it
        result = dmp.patch_apply(patch, dst_text)

        #write
        dst.seek(0)
        dst.write(result[0])
        dst.truncate()

edited Jan 20 '19 at 16:23

answered Jan 20 '19 at 16:11

y.luis.rojo

1,794
4
22
41

@krock1516, pip install did not work for you? Which error do you get? – y.luis.rojo Jan 20 '19 at 16:26
1

I never removed `Info_month`... I just wrote the interesting fragment of the code for you to integrate it into yours. – y.luis.rojo Jan 20 '19 at 16:29
BTW, you can download pip library (https://files.pythonhosted.org/packages/f0/2a/5ba07def0e9107d935aba62cf632afbd0f7c723a98af47ccbcab753d2452/diff-match-patch-20181111.tar.gz) and install it from the local file: https://packaging.python.org/tutorials/installing-packages/#installing-from-local-archives. – y.luis.rojo Jan 20 '19 at 16:32
You do not need access to outside. If you have ssh access to it (do you?), you can copy the downloaded file with `scp`: http://www.hypexr.org/linux_scp_help.php – y.luis.rojo Jan 20 '19 at 16:43
I know scp and other file transfer mechanism but my question is at first place how will you get the pkg before you copy it when you do not have it into your local/private repository? – krock1516 Jan 20 '19 at 16:50
As I proposed before, from a (personal?) machine with access to the Internet, and `ssh` access to your server, you download the file (https://pypi.org/project/diff-match-patch/#files) and then copy it using `scp`. – y.luis.rojo Jan 20 '19 at 17:06
this is my home system , I cant copy to my office system :( . – krock1516 Jan 20 '19 at 17:07
How do you access your office system? How do you deploy your code in it? – y.luis.rojo Jan 20 '19 at 17:09
This something out of the way question but i have my own home lab setup for my personal R&D which i used to test upon and take liberty to test by writing my own . there is no way to access code rather writing it again there. this is just test simulation piece. – krock1516 Jan 20 '19 at 17:17

Karn Kumar · Accepted Answer · 2019-01-22T17:53:27.933

As aforementioned rsync is a better way to do this kind of Job where you need to carry out incremental file list or say delta of the data So, i would rather prefer doing it with rsync and subprocess module all along.

However, you can also assign a variable Curr_date_month to get the current date, month and year as your requirement to just copy the files from the Current Month and day Folder. also you can define the source and destination variable just for the ease of writing them up into the code.

Secondly, Though you have a check for the file-size with getsize but i would like to add an rsync option parameter --min-size= to make sure not to copy zero byte file.

Your final code goes here.

#!/bin/python3
import os
import glob
import datetime
import subprocess

def Copy_Logs():
    # Variable Declaration to get the month and Curr_date_month
    Info_month = datetime.datetime.now().strftime("%B")
    Curr_date_month = datetime.datetime.now().strftime("%b_%d_%y") 
    Sourcedir = "/data1/logs"
    Destdir = "/data2/logs/"
    ###### End of your variable section #######################
    # The result of the below glob _is_ a full path
    for filename in glob.glob("{2}/{0}/{1}/*.txt".format(Info_month, Curr_date_month, Sourcedir)):
        if os.path.getsize(filename) > 0:
            if not os.path.exists(Destdir + os.path.basename(filename)):
                subprocess.call(['rsync', '-avz', '--min-size=1', filename, Destdir ])

if __name__ == '__main__':
    Copy_Logs()

Its works as I tested, however looking for some other comments and answer to conclude the answer. — krock1516, Jan 23 '19 at 06:46

ndrwnaguib · Answer 3 · 2019-01-20T15:41:16.713

One way is to save a single line to a file to keep tracking of the latest time (with the help of os.path.getctime) you copied the files and maintain that line each time you copy.

Note: The following snippet can be optimized.

import datetime
import glob
import os
import shutil

Info_month = datetime.datetime.now().strftime("%B")
list_of_files = sorted(glob.iglob("/data1/logs/{0}/*/*.txt".format(Info_month)), key=os.path.getctime, reverse=True)
if not os.path.exists("track_modifications.txt"):
    latest_file_modified_time = os.path.getctime(list_of_files[0])
    for filename in list_of_files:
            shutil.copy(filename, "/data2/logs/")
    with open('track_modifications.txt', 'w') as the_file:
        the_file.write(str(latest_file_modified_time))
else:
    with open('track_modifications.txt', 'r') as the_file:
        latest_file_modified_time = the_file.readline()
    should_copy_files = [filename for filename in list_of_files if
                         os.path.getctime(filename) > float(latest_file_modified_time)]
    for filename in should_copy_files:
            shutil.copy(filename, "/data2/logs/")

The approach is, creating a file that contains the timestamp of the latest file that was modified by the system.

Retrieving all the files and sorting them by the modification time

list_of_files = sorted(glob.iglob('directory/*.txt'), key=os.path.getctime, reverse=True)

Initially, in if not os.path.exists("track_modifications.txt"): I check if this file does not exists (i.e., first time to copy), then I save the largest file timestamp in

latest_file_modified_time = os.path.getctime(list_of_files[0])

And I just copy all files given and write this timestamp to the track_modifications file.

else, the file exists (i.e., there were files copied before), I just go read that timestamp and compare it with the list of files I read in list_of_files and retrieve all files with a larger timestamp (i.e., created after the last file I copied). That is in

should_copy_files = [filename for filename in list_of_files if os.path.getctime(filename) > float(latest_file_modified_time)]

Actually, tracking the timestamp of the latest modified files would also give you the advantage of copying the files that were already copied when they're changed :)

This however copies the whole file which will be a real issue in case the files are large instead of copying the diff only. Also you don't track any changes here therefore a simple `touch file.txt` would make your script copy the files even if nothing changes in them (let's say an app accesses the file and tracks it with a timestamp). — Peter Badida, Jan 20 '19 at 15:33
True it copies the whole file if it wasn't copied before. No, a simple `touch file.txt` won't make it copy all files but the `file.txt` only. — ndrwnaguib, Jan 20 '19 at 15:35
@krock1516 I will change the paths to be as yours. I was just illustrating an approach:) — ndrwnaguib, Jan 20 '19 at 15:36
Thanks appreciate all your inputs and answer.. just waiting for another approaches across this post. — krock1516, Jan 20 '19 at 15:43

score 2 · Answer 4 · answered Jan 20 '19 at 16:41

There are some very interesting ideas in this thread, but I will try to propose some new ideas.

Idea no. 1: Better way for tracking updates

Per your question, it's clear that you are using a cron job to keep track of the updated file.

If you are trying to monitor a relatively small amount of files/directories, I would propose a different approach that will simplify your life.

You can use the Linux inotify mechanism, that allows you to monitor specific files/directories and get notified whenever a file is written to.

Pro: You know of every single write immediately, without needing to check for changes. You can of course write a handler that doesn't update the destination for every write, but one in X minutes.

Here is an example that uses the inotify python package (taken from the package's page):

import inotify.adapters

def _main():
    i = inotify.adapters.Inotify()

    i.add_watch('/tmp')

    with open('/tmp/test_file', 'w'):
        pass

    for event in i.event_gen(yield_nones=False):
        (_, type_names, path, filename) = event

        print("PATH=[{}] FILENAME=[{}] EVENT_TYPES={}".format(
              path, filename, type_names))

if __name__ == '__main__':
    _main()

Idea no. 2: Copying only the changes

If you decide to use the inotify mechanism, it will be trivial to keep track of your state.

Then, there are two possibilities:

1. New contents are ALWAYS appended

If this is the case, you can simply copy anything from the your last offset till the end of the file.

2. New contents are written at random locations

In this case, I would recommend a method proposed by other answers as well: Using diff patches. This is by far the most elegant solution in my opinion.

Some options here are:

Daniel, Thanks a mile for your answer i'll see how it can be used, though some interesting ideas :-) — krock1516, Jan 20 '19 at 16:47
@krock1516, happy to help. Please let me know if you need a further drill down :) — Daniel Trugman, Jan 20 '19 at 17:22

igrinis · Answer 5 · 2019-01-22T07:05:37.847

2

One of the benefits of rsync is that it only copies differences between the files. As files become huge, it drastically reduce I/O.

There are plethora of rsync-like implementations and wrappers around the original program in PyPI. This blog post describes how to implement rsync in Python in a very good way, and can be used as is.

As for checking whether it is needed to do the sync at all, you can use filecmp.cmp(). In it's shallow variant it only checks the os.stat() signature.

edited Jan 22 '19 at 07:05

answered Jan 22 '19 at 06:55

igrinis

12,398
20
45

@- igrinis, thanks for your answer I appreciate it , yes I know rsync is versatile way for these type of request but that what i'm looking for my code However, with python implementation I can simple use it as `rsync -av --min-size=1 sourc_path Dest_path` and all done. – krock1516 Jan 22 '19 at 06:58
I think you have missed the blog post part in the answer ;) – igrinis Jan 22 '19 at 07:02
NP. If it's something like log files you can assume that all the differences are at the end of the file, and copy only the tail portion. Then you just check for the size, and if differ, copy from the original file starting from DESTINATION file size, and append to the copy. – igrinis Jan 22 '19 at 07:39

Peter Badida · Answer 6 · 2019-01-20T15:20:09.200

You need to save the changes somewhere or listen to the event when the file contents change. For the latter you can use watchdog.

If you decide you really prefer cron instead of incrementally checking for the changes (watchdog) you'll need to store the changes in some database. Some basic example would be:

ID | path        | state before cron
1  | /myfile.txt | hello
...| ...         | ...

then to check the diff you'd dump the state before cron to a file, run a simple diff old.txt new.txt and if there is some output (i.e. there is a change), you would copy either the whole file or just the output of the diff alone which you would then apply as a patch to the file you want to overwrite.

In case there is no diff output, there is no change and therefore nothing to update in the file.

Edit: Actually :D you might not even need a database if the files are on the same machine... That way you can just diff+patch directly between the old and new files.

Example:

$ echo 'hello' > old.txt && echo 'hello' > new.txt
$ diff old.txt new.txt                             # empty
$ echo 'how are you' >> new.txt                    # your file changed
$ diff old.txt new.txt > my.patch && cat my.patch  # diff is not empty now
1a2
> how are you

$ patch old.txt < my.patch  # apply the changes to the old file

and in Python with the same old.txt and new.txt base:

from subprocess import Popen, PIPE
diff = Popen(['diff', 'old.txt', 'new.txt']).communicate()[0]
Popen(['patch', 'old.txt'], stdin=PIPE).communicate(input=diff)

KeyWeeUsr, Thanks for the answer with a different approach but how could I use with my current code, I'm looking for an alternative solution if not mine though, I'm on the same machine. — krock1516, Jan 20 '19 at 14:56
@krock1516 I added an example. You would just need to repeat it for each file in your location. — Peter Badida, Jan 20 '19 at 15:20

Prakruti Chandak · Answer 7 · 2019-01-15T11:25:10.057

-2

you'll have to integrate a database, and you can keep a record of files according to the size, name and author.

in case of any updates, there'll be a change in the size of the file, you can update or append accordingly

edited Jan 15 '19 at 11:25

answered Jan 15 '19 at 11:03

Prakruti Chandak

1
4

Prakruti, this I don't required to integrate database and keep records there might be other handy ways like rsync. – krock1516 Jan 15 '19 at 11:26

How to copy only the changed file-contents on the already existed destination file?

7 Answers7

Idea no. 1: Better way for tracking updates

Idea no. 2: Copying only the changes