0

I have a folder into which new files are constantly being added. I have a python script that uses os.listdir() to find these files and then perform analysis on them automatically. However, the files are quite large and so they seem to show up in os.listdir() before they've actually been completely written/copied. Is there some way to distinguish which files are not in the process of being moved? Comparing sizes with os.path.getsize() doesn't seem to work.

Raspbian Buster on Pi4 with Python 3.7.3. I am a noob to programming and linux.

Thanks!

oguz ismail
  • 1
  • 16
  • 47
  • 69
rfii
  • 562
  • 3
  • 14
  • Does this answer your question? [Is a move operation in Unix atomic?](https://stackoverflow.com/questions/18706419/is-a-move-operation-in-unix-atomic), also on the Unix StackExchange, [Is mv atomic operation between two file systems?](https://unix.stackexchange.com/questions/452620/is-mv-atomic-operation-between-two-file-systems) – metatoaster Aug 04 '20 at 03:59
  • 1
    A workaround would be to have the process that is creating the file do so in a temporary location on the same filesystem as the intended location, and only when it's done call `rename` to atomically move it to the final location where the Python program expects it. – metatoaster Aug 04 '20 at 04:04
  • Thanks for the links! I couldn't find documentation on linux rename moving files but I see that rename(2) moves files. Or did you mean python's os.rename? https://alexwlchan.net/2019/03/atomic-cross-filesystem-moves-in-python/ – rfii Aug 13 '20 at 16:47
  • The standard utility `mv` makes use of `rename(2)` to "move" files (as documented on the man page you found); likewise Python's [`os.rename`](https://docs.python.org/3/library/os.html#os.rename) call should function similarly as documented, with the additional Python datatypes being supported, which the interpreter will unwrap into the native binary type before making the appropriate system call(s). – metatoaster Aug 14 '20 at 03:15
  • ok thanks so linux `mv` = linux `rename` = python `os.rename` but does not equal python `shutil.move` – rfii Aug 14 '20 at 04:02
  • Not quite so simple as that: [`shutil.move`](https://docs.python.org/3/library/shutil.html#shutil.move) does use `os.rename` as documented [(refer to the implementation)](https://github.com/python/cpython/blob/v3.8.4/Lib/shutil.py#L749..L804), with additional handling for different failure cases. – metatoaster Aug 14 '20 at 04:43
  • ok now I am confused bc if os.rename is atomic and shutil.move uses rename, then why does it seem like shutil.move is not atomic? – rfii Aug 14 '20 at 04:55
  • `shutil.move` does not _only_ use rename (the linked code clearly showed that), it will use the other function calls as documented when the `os.rename` returns a failure (and a retry using a different, non-atomic method will be done). Even if `os.rename` is used, if the source and destination are on different filesystems, the `rename(2)` call cannot be used and thus a read and write to new location will be done. The `rename` system call is only applicable if source and destination are on the same filesystem, otherwise it will fail (and `mv` will do non-atomic read/write to "move" the file) – metatoaster Aug 14 '20 at 05:36

2 Answers2

1

For a conceptual explanation of Atomic and cross filesystem moves, refer this moves in Python ( can really save your time)

You can take the following approaches to deal with your problem:-

->Monitor Filesystem Events with Pyinotify usage of Pynotify

-> Lock the file for few seconds using flock

-> Using lsof we can basically check for the processes that are using a particular file.

`from subprocess import check_output,Popen, PIPE
try:
   lsout=Popen(['lsof',filename],stdout=PIPE, shell=False)
   check_output(["grep",filename], stdin=lsout.stdout, shell=False)
except:
   #check_output will throw an exception here if it won't find any process using that file`

just write your log processing code in the except part and you are good to go.

-> a daemon that monitors the parent folder for any changes, by using, E.G., the watchdog library watchdog implementation

-> You can either check the file which is being used by another process by looping through the PID/s in /proc for a specific id (assuming you have the control over the program which is adding the new files continuously to identify its id).

-> Can check if a file has a handle on it using psutil.

Jatin Mehrotra
  • 9,286
  • 4
  • 28
  • 67
1

In programming this is called concurrency, which is when computations happen simultaneously and the order of execution is not guaranteed. In your case, one program begins to read a file before another program has finished writing to it. This particular problem is called the reader-writers problem and is actually fairly common in embedded systems.

There are a number of solutions to this problem, but the simplest and most common is a lock. The simplest kind of lock protects a resource from being accessed by more than one program at the same time. In effect, it makes sure that operations on the resource happen atomically. A lock is implemented as an object that can be acquired or released (these are usually functions of the object). The program tries to acquire the lock in a loop that iterates for as long as the program does not acquire the lock. When the lock is acquired, it grants the program holding it the ability to execute some block of code (this is usually a simple if-statement), after which the lock is released. Note that what I am referring to as a program is typically called a thread.

In Python, you can use the threading.Lock object. First, you need to create a Lock object.

from threading import Lock
file_lock = Lock()

Then in each thread, wait to acquire the lock before proceeding. If you set blocking=True, it will cause the entire thread to stop running until the lock is acquired, without requiring a loop.

file_lock.acquire(blocking=True):
# atomic operation
file_lock.release()

Note that the same lock object should be used in each thread. You will need to acquire the lock before reading and writing to the file, and you will need to release the lock after reading and writing to the file. That will make sure those operations do not happen at the same time again.

tk744
  • 43
  • 6
  • I appreciate all the explanation. It helps a lot. I am having a bit of trouble with the documentation. (A) Do I make a new lock for each file? (B) Am I understanding correctly that this solution requires I have two threads in one python programs (a file mover thread and a file analyzer thread) instead of two separate programs? (C) Also, can this work with python's multiprocess subprocess because I'd like to use its Queue feature? Thanks again! – rfii Aug 14 '20 at 05:07