I want to make a library of tens of thousands of files with node.js, stored in a database (sqlite or something) (similar to how Plex does it for videos). The files will be locally available to the node.js server or through a NAS or something. After a file is processed, information about the file (and its location) is stored in a database. I want to make a scan feature that can scan a certain directory (and subdirectories of that directory) for files. I want to skip the files that are already processed before. What is the best way to keep track of which files are already processed? It need to work for seveveral tens of thousands of files. A couple of ideas I have:
- Use a file watcher like
fs.watch
orchokidar
. Downside is that this watcher always needs to run in order to detect new files and will not work backwards when server is down. - Cron job to go over files and move the files to a new directory when they are processed (prefer a solution where I do not need to move the files)
- Based on content hash: hash and store the content of the processed files and check if the hash of a new file is already in the DB (would require a DB call for each file, and also the content has to be checked and hashed for each file, making performance bad)
- Based on just filenames: Get all processed filenames from the DB and loop over all files and check if they are in the list of filenames already processed. Performance would probably be bad when there are a lot of files (both going over that many files and storing all processed filesnames from the DB in an object, making the memory the bottleneck).
All above scenarios have performance issues and probably won't work when there are many files to check. The only performant solution I can think of is grabbing 10 or so files everytime from a needs-processing
directory and move the files to a processed
directory, but I would like a performant solution where I don't have to move the files. I want a single folder where I can upload all the files, and when I upload a new files it either periodically checks for new files or I have to trigger a rescan library to check for new files.