File retention mechanism in a large data storage

Question

recently I faced performance problem with mp4 files retention. I have kind a recorder which saves 1 min long mp4 files from multiple RTSP streams. Those files are stored on external drive in file tree like this:

./recordings/{camera_name}/{YYYY-MM-DD}/{HH-MM}.mp4

Apart from video files, there are many other files on this drive which are not considered (unless they have mp4 extension), as they took much less space.

Assumption of file retention is as follows. Every minute, python script that is responsible for recording, check for external drive fulfillment level. If the level is above 80%, it performs a scan of the whole drive, and look for .mp4 files. When scanning is done, it sorts a list of files by its creation date, and deletes the number of the oldest files which is equal to the cameras number.

The part of the code, which is responsible for files retention, is shown below.

total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
    logging.info("SSD usage %s. Looking for the oldest files", used_percent)
    try:
        oldest_files = sorted(
            (
                os.path.join(dirname, filename)
                for dirname, dirnames, filenames in os.walk('/home')
                for filename in filenames
                if filename.endswith(".mp4")
            ),
            key=lambda fn: os.stat(fn).st_mtime,
        )[:len(camera_devices)]
        logging.info("Removing %s", oldest_files)
        for oldest_file in oldest_files:
            os.remove(oldest_file)
            logging.info("%s removed", oldest_file)
    except ValueError as e:
        # no files to delete
        pass

(/home is external drive mount point)

The problem is that this mechanism used to work as a charm, when I used 256 or 512 GB SSD. Now I have a need of larger space (more cameras and longer storage time), and it takes a lot of time to create files list on larger SSD (from 2 to 5 TB now and maybe 8 TB in the future). The scanning process takes a lot more than 1 min, what could be resolved by performing it more rarely, and extending the length of "to delete" files list. The real problem is, that the process uses a lot of CPU load (by I/O ops) itself. The performance drop is visible is the whole system. Other applications, like some simple computer vision algorithms, works slower, and CPU load can even cause kernel panic.

The HW I work on is Nvidia Jetson Nano and Xavier NX. Both devices have problem with performance as I described above.

The question is if you know some algorithms or out of the box software for file retention that will work on the case I described. Or maybe there is a way to rewrite my code, to let it be more reliable and perform?

EDIT:

I was able to lower os.walk() impact by limit space to check.Now I just scan /home/recordings and /home/recognition/ which also lower directory tree (for recursive scan). At the same time, I've added .jpg files checking, so now I look from both .mp4 and .jpg. Result is much better in this implementation.

However, I need further optimization. I prepared some test cases, and tested them on 1 TB drive which is 80% filled (media files mostly). I attached profiler results per case below.

@time_measure
def method6():
    paths = [
        "/home/recordings",
        "/home/recognition",
        "/home/recognition/marked_frames",
    ]
    files = []
    for path in paths:
        files.extend((
            os.path.join(dirname, filename)
            for dirname, dirnames, filenames in os.walk(path)
            for filename in filenames
            if (filename.endswith(".mp4") or filename.endswith(".jpg")) and not os.path.islink(os.path.join(dirname, filename))
        ))
    oldest_files = sorted(
        files,
        key=lambda fn: os.stat(fn).st_mtime,
    )
    print(oldest_files[:5])

@time_measure
def method7():
    ext = [".mp4", ".jpg"]
    paths = [
        "/home/recordings/*/*/*",
        "/home/recognition/*",
        "/home/recognition/marked_frames/*",
    ]
    files = []
    for path in paths:
        files.extend((file for file in glob(path) if not os.path.islink(file) and (file.endswith(".mp4") or file.endswith(".jpg"))))
    oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
    print(oldest_files[:5])

The original implementation on the same data set last ~100 s

EDIT2

@norok2 proposals comparation

I compared them with method6 and method7 from above. I tried several times with similar result.

Testing method7
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 24.73726773262024 s
_________________________
Testing find_oldest
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 34.355509757995605 s
_________________________
Testing find_oldest_cython
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 25.81963086128235 s

method7 (glob())

iglob()

Cython

Crazy idea: can you change the name convention of your files on include the create timestamp, in a way that if you sort them based on name they are automatically sorted by creation? This way you can avoid the lambda execution. — Rodrigo Rodrigues, Jun 18 '22 at 17:22
Or, can you manage the directory structure in a way that you can decide what to purge passed solely on the folder name? This way, you can avoid the os.walk over the whole /home at file level — Rodrigo Rodrigues, Jun 18 '22 at 17:24
I am unfamiliar with Jetson, but maybe you can let the OS (if it has one) run a `find` command under `nice` in a Python `subprocess()` so it runs at a lower priority and doesn't interfere with other operations. — Mark Setchell, Jun 24 '22 at 09:58
Maybe you can alter your code that makes recordings to additionally add a record of the filename/date into a little `sqlite` database each time you make a recording and then you won't need to bother the filesystem when you want to find deletion candidates, just do an SQL query instead. — Mark Setchell, Jun 24 '22 at 10:01
I assume that `os.walk()` is the main bottleneck, but you can probably speed up your code by partitioning instead of sorting, as that would require less calls to `os.stat()`. Have you profiled it? — StefOverflow, Jun 25 '22 at 08:25
@StefOverflow Yes I agree os.walk() is main problem. I was able to lower its impact by restricting the path from the whole disk to just a directory that includes main recordings files (I'll handle the rest of files separately). Now I got much better result, but still need further optimization. See my edited question. — przemoch, Jun 25 '22 at 13:52
Can you report your includes? glob is the name of a module, but you seem to have used it differently. I'd say glob.iglob is going to be faster than glob.glob — norok2, Jun 25 '22 at 14:07

norok2 · Answer 1 · 2022-06-25T15:28:41.853

You could get an extra few percent speed-up on top of your method7() with the following:

import os
import glob


def find_oldest(paths=("*",), exts=(".mp4", ".jpg"), k=5):
    result = [      
        filename
        for path in paths
        for filename in glob.iglob(path)
        if any(filename.endswith(ext) for ext in exts) and not os.path.islink(filename)]
    mtime_idxs = sorted(
        (os.stat(fn).st_mtime, i)
        for i, fn in enumerate(result))
    return [result[mtime_idxs[i][1]] for i in range(k)]

The main improvements are:

use iglob instead of glob -- while it may be of comparable speed, it takes significantly less memory which may help on low end machines
str.endswith() is done before the allegedly more expensive os.path.islink() which helps reducing the number of such calls due to shortcircuiting
an intermediate list with all the mtimes is produces to minimize the os.stat() calls

This can be sped up even further with Cython:

%%cython --cplus -c-O3 -c-march=native -a

import os
import glob


cpdef find_oldest_cy(paths=("*",), exts=(".mp4", ".jpg"), k=5):
    result = []
    for path in paths:
        for filename in glob.iglob(path):
            good_ext = False
            for ext in exts:
                if filename.endswith(ext):
                    good_ext = True
                    break
            if good_ext and not os.path.islink(filename):
                result.append(filename)
    mtime_idxs = []
    for i, fn in enumerate(result):
        mtime_idxs.append((os.stat(fn).st_mtime, i))
    mtime_idxs.sort()
    return [result[mtime_idxs[i][1]] for i in range(k)]

My tests on the following files:

def gen_files(n, exts=("mp4", "jpg", "txt"), filename="somefile", content="content"):
    for i in range(n):
        ext = exts[i % len(exts)]
        with open(f"{filename}{i}.{ext}", "w") as f:
            f.write(content)


gen_files(10_000)

produces the following:

funcs = find_oldest_OP, find_oldest, find_oldest_cy


timings = []
base = funcs[0]()
for func in funcs:
    res = func()
    is_good = base == res
    timed = %timeit -r 8 -n 4 -q -o func()
    timing = timed.best * 1e3
    timings.append(timing if is_good else None)
    print(f"{func.__name__:>24}  {is_good}  {timing:10.3f} ms")
#           find_oldest_OP  True      81.074 ms
#              find_oldest  True      70.994 ms
#           find_oldest_cy  True      64.335 ms

find_oldest_OP is the following, based on method7() from OP:

def find_oldest_OP(paths=("*",), exts=(".mp4", ".jpg"), k=5):
    files = []
    for path in paths:
        files.extend(
            (file for file in glob.glob(path)
            if not os.path.islink(file) and any(file.endswith(ext) for ext in exts)))
    oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
    return oldest_files[:k]

The Cython version seems to point to a ~25% reduction in execution time.

Idk why, method with `iglob()`, you proposed, takes a bit longer than `glob()` one (method7). 23.3 s VS 31.3 s. I tested it several times. — przemoch, Jun 25 '22 at 19:22
`Cython` takes almost equal time to method7. I put profiler output to the question edit — przemoch, Jun 25 '22 at 19:23
@przemoch Clearly, you should try whatever works best with your system. I would try getting the files with `find`. — norok2, Jun 26 '22 at 12:18

Utshaan · Answer 2 · 2022-06-24T09:55:52.437

0

You could use the subprocess module to list all the mp4 files directly, without having to loop through all the files in the directory.

import subprocess as sb
oldest_files = sb.getoutput("dir /b /s .\home\*.mp4").split("\n")).sort(lambda fn: os.stat(fn).st_mtime,)[:len(camera_devices)]

edited Jun 24 '22 at 09:55

answered Jun 24 '22 at 09:09

Utshaan

92
8

This is actually a good idea, perhaps you could elaborate further? – StefOverflow Jun 25 '22 at 08:21
Well it's given that it's using windows cmd, i'm sure linux/mac have a similar parallel to the `dir /b /s .\home\*mp4` command. It's basically just running the inbuilt file listing commnad, and in that the switch /b makes sure that it's bare format, so just the files/folders and nothing else, and the /s searches for all directories inside the particular folder. '.\home\*.mp4' is the part where you feed in the directory to search, and a wildcard for mp4. As this is a native command, and doesnt require one to loop through the files again, it cuts down on execution time – Utshaan Jun 25 '22 at 08:54

IamFr0ssT · Answer 3 · 2022-06-24T20:27:52.103

A quick optimization would be not to bother checking file creation time and trusting the filename.

total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
    logging.info("SSD usage %s. Looking for the oldest files", used_percent)
    try:
        files = []
        for dirname, dirnames, filenames in os.walk('/home/recordings'):
            for filename in filenames:
                files.push((
                    name := os.path.join(dirname, filename),
                    datetime.strptime(
                        re.search(r'\d{4}-\d{2}-\d{2}\/\d{2}-\d{2}', name)[0],
                        "%Y-%m-%d/%H-%M"
                        ))
        oldest_files = files.sort(key=lambda e: e[1])[:len(camera_devices)]
        logging.info("Removing %s", oldest_files)
        for oldest_file in oldest_files:
            os.remove(oldest_file)
            # logging.info("%s removed", oldest_file)
        logging.info("Removed")
    except ValueError as e:
        # no files to delete
        pass

File retention mechanism in a large data storage

3 Answers3