I have a very large folder of files and I would like to create a db of the file names, path, and format for easy access.
Firstly, is TinyDB a suitable solution for this db? I really want to keep it small and localised. Speed is not vital, but would be nice... obviously. It would contain roughly 500,000 rows.
Secondly, I can't find any good information on an efficient way to complete this task. I have searched but results are either for python 2 or comments on results I've found seem to elude to some other more efficient way of doing things, but they don't go into detail. The top level folder contains a lot of folders, each containing multiple files, so efficiency in this use case would be ideal.
I am able to code everything, but I am just not sure how to do the searching part given the size of the task.
This is what I have so far:
import os
from tinydb import TinyDB, Query
from tinydb.storages import JSONStorage
from tinydb.middlewares import CachingMiddleware
import time
db = TinyDB('db.json', storage=CachingMiddleware(JSONStorage))
start = time.time()
for path, currentDirectory, files in os.walk("test-folder"):
for file in files:
db.insert({'path': os.path.join(path, file), 'name': file, 'type': 'txt'})
db.close()
end = time.time()
print("parsed " + str(len(db)) + " files in " + str((end - start) * 1000) + "ms")
# parsed 2452 files in 2846.367120742798ms
# 400kb db
This code works ok for smaller dbs (few thousand rows) but gets slow quickly.
I ended up with this code, which seems to just work. I don't know how it would handle even larger folder structures than my full dataset, but it seemed to not take up much RAM and worked relatively quickly (for the full dataset it averaged about 1.3ms per file):
import sqlite3
import time
import os
start = time.time()
con = sqlite3.connect('my-db.db')
cur = con.cursor()
# Create table
cur.execute('DROP TABLE files')
cur.execute('''CREATE TABLE files
(filename, file path, format)''')
for path, currentDirectory, files in os.walk("my-folder"):
for file in files:
#db.insert({'path': os.path.join(path, file), 'name': file, 'type': 'txt'})
cur.execute("INSERT INTO files VALUES (?, ?, ?)", (file, os.path.join(path, file), 'txt'))
con.commit()
end = time.time()
cur = con.execute('select * from files;')
print("parsed " + str(len(cur.fetchall())) + " files in " + str((end - start) * 1000) + "ms")
# parsed 72817 files in 82751.31034851074ms
# size 8116 KB
con.close()
I have done what I needed to this time, but my question still stands.
How can I improve efficiency further?