Scan/Track many files and skip already processed files with node.js

Question

I want to make a library of tens of thousands of files with node.js, stored in a database (sqlite or something) (similar to how Plex does it for videos). The files will be locally available to the node.js server or through a NAS or something. After a file is processed, information about the file (and its location) is stored in a database. I want to make a scan feature that can scan a certain directory (and subdirectories of that directory) for files. I want to skip the files that are already processed before. What is the best way to keep track of which files are already processed? It need to work for seveveral tens of thousands of files. A couple of ideas I have:

Use a file watcher like fs.watch or chokidar. Downside is that this watcher always needs to run in order to detect new files and will not work backwards when server is down.
Cron job to go over files and move the files to a new directory when they are processed (prefer a solution where I do not need to move the files)
Based on content hash: hash and store the content of the processed files and check if the hash of a new file is already in the DB (would require a DB call for each file, and also the content has to be checked and hashed for each file, making performance bad)
Based on just filenames: Get all processed filenames from the DB and loop over all files and check if they are in the list of filenames already processed. Performance would probably be bad when there are a lot of files (both going over that many files and storing all processed filesnames from the DB in an object, making the memory the bottleneck).

All above scenarios have performance issues and probably won't work when there are many files to check. The only performant solution I can think of is grabbing 10 or so files everytime from a needs-processing directory and move the files to a processed directory, but I would like a performant solution where I don't have to move the files. I want a single folder where I can upload all the files, and when I upload a new files it either periodically checks for new files or I have to trigger a rescan library to check for new files.

Maybe need a little more clarification about one thing: you need something to automatically update the DB on new files, but the reason something like `fs.watch` is off the table is because it "always needs to run". I don' think you will be able to get the functionality you want between the filesystem and the db _without_ something like a watcher/daemon to be waiting and listening (assuming also no cron job). I think either way something around your first proposed solution, or combining first and second, is the correct direction here. — mik rus, Jul 02 '21 at 16:07
I am not sure fs.watch would work at scale.. Suppose I add another 10k files to my library in one go. This would probably not go well with something like fs.watch. So I would need something that can process these 10k files in batches, but not even sure in node.js how you could grab say the first x files, process them, and then grab the next x files.. — Laurens, Jul 04 '21 at 10:14
Why not use log files? Once a file123.txt is processed, create a file123.txt.log file which can be empty(0kb), or contain the hash of the file if you want to later check it to see if you have processed the latest version. Then you only need to process files with no .log file equivalent, and only re-process if hash is different, whenever you want. — Herald Smit, Jul 06 '21 at 11:47

score 1 · Answer 1 · answered Jul 03 '21 at 17:47

Store the files directly in the database as opposed to their location. Using Filestream is an option. Then you just add some sort of a flag that indicates if its been processed. Then you can just loop over all the files and know if they have been processed or not. Just make sure to update the table for processed files. Depending on the processing you could also limit processing to times that are convenient.

Ex.) If there is a chance a file will not be used, but it needs to be processed before use. Then you can just process the file before the call and avoid checking constantly or periodically.

Perfromance-wise this could even be faster than the filesystem in terms of read-write. From the SQLite website:

... many developers are surprised to learn that SQLite can read and write smaller BLOBs (less than about 100KB in size) from its database faster than those same blobs can be read or written as separate files from the filesystem. (See 35% Faster Than The Filesystem and Internal Versus External BLOBs for further information.) There is overhead associated with operating a relational database engine, however one should not assume that direct file I/O is faster than SQLite database I/O, as often it is not.

Maybe I am misunderstanding Filestream, but to me it seems this isn't really a solution of how I can process files at scale.. (storing them in the database is a form of processing the files). Suppose I add 10k files to my library folder, how would I go about processing all of them and storing them in the database (and why would I even want to store them in the db?) — Laurens, Jul 04 '21 at 10:16
I don't think databases are supposed to be used as file storages. DB & File Storage should be separate unless you have a solid reason to go otherwise. — Arpit, Jul 08 '21 at 13:19

score 1 · Answer 2 · answered Jul 08 '21 at 12:57

As you are storing files processing info in DB, get the last processing time from DB in single query and process all the files which are created after that timestamp.

For filtering files via timestamp How to read a file from directory sorting date modified in Node JS

And if you can control directory structure than partition your files by datetime and other primary/secondary keys.

score 1 · Answer 3 · answered Jul 08 '21 at 13:00

How about option 5: based on time? If you know the last time you processed the directory was at timestamp x, then the next go around you can skip all files older than x just by looking at the file stats. Then from this smaller subset you can use hashes to look for clashes.

Edit: Seems arpit and I were typing the same general idea at the same time. Note though that the sorting method in the link he included will iterate over all 10k files 3 times. You don't need to sort anything, you just need to iterate through once and process the ones that fit the bill.

Scan/Track many files and skip already processed files with node.js

3 Answers3