0

I have a very large folder of files and I would like to create a db of the file names, path, and format for easy access.

Firstly, is TinyDB a suitable solution for this db? I really want to keep it small and localised. Speed is not vital, but would be nice... obviously. It would contain roughly 500,000 rows.

Secondly, I can't find any good information on an efficient way to complete this task. I have searched but results are either for python 2 or comments on results I've found seem to elude to some other more efficient way of doing things, but they don't go into detail. The top level folder contains a lot of folders, each containing multiple files, so efficiency in this use case would be ideal.

I am able to code everything, but I am just not sure how to do the searching part given the size of the task.

This is what I have so far:

import os
from tinydb import TinyDB, Query
from tinydb.storages import JSONStorage
from tinydb.middlewares import CachingMiddleware
import time

db = TinyDB('db.json', storage=CachingMiddleware(JSONStorage))

start = time.time()

for path, currentDirectory, files in os.walk("test-folder"):
    for file in files:
        db.insert({'path': os.path.join(path, file), 'name': file, 'type': 'txt'})

db.close()

end = time.time()
print("parsed " + str(len(db)) + " files in " + str((end - start) * 1000) + "ms")
# parsed 2452 files in 2846.367120742798ms
# 400kb db

This code works ok for smaller dbs (few thousand rows) but gets slow quickly.

I ended up with this code, which seems to just work. I don't know how it would handle even larger folder structures than my full dataset, but it seemed to not take up much RAM and worked relatively quickly (for the full dataset it averaged about 1.3ms per file):

import sqlite3
import time
import os

start = time.time()

con = sqlite3.connect('my-db.db')

cur = con.cursor()

# Create table
cur.execute('DROP TABLE files')
cur.execute('''CREATE TABLE files
               (filename, file path, format)''')


for path, currentDirectory, files in os.walk("my-folder"):
    for file in files:
        #db.insert({'path': os.path.join(path, file), 'name': file, 'type': 'txt'})
        cur.execute("INSERT INTO files VALUES (?, ?, ?)", (file, os.path.join(path, file), 'txt'))
        

con.commit()
end = time.time()
cur = con.execute('select * from files;')
print("parsed " + str(len(cur.fetchall())) + " files in " + str((end - start) * 1000) + "ms")
# parsed 72817 files in 82751.31034851074ms
# size 8116 KB

con.close()

I have done what I needed to this time, but my question still stands.

How can I improve efficiency further?

yuuuu
  • 738
  • 3
  • 11
  • Have you tried using the `executemany()` command, instead of the `execute()` command that you are using? You will have to combine a series of records in a list first to use it and so it will be a bit heavier on memory, but you could e.g. use blocks of 5,000 or so records each time and then reuse the list? https://stackoverflow.com/questions/43785569/for-loop-or-executemany-python-and-sqlite3 You could also use Python's `glob.glob()` command to get the file listing. That might be faster than `os.walk()`. – robbo Dec 25 '21 at 05:03
  • Thanks for the suggestions @robbo I will have a go with these :) – yuuuu Dec 27 '21 at 18:06

0 Answers0