-2

I have the following code to work with my files. If file is added to the database before, then the file is just skipped (pass).

import sqlite3 as lite

for i, file in enumerate( allMedia ):
  con = lite.connect(DB_PATH)
  con.text_factory = str
  with con:
    cur = con.cursor()
    cur.execute("SELECT rowid,files_id,path,set_id,md5,tagged FROM files WHERE path = ?", (file,))
    row = cur.fetchone()
    if (row is not None):
       pass

The problem with this code is slow processing (2-3 seconds for each file found in the database). The database size is ~30 Mb. And thousands of files should be processed.

Is there any way to speed up the process?

LA_
  • 19,823
  • 58
  • 172
  • 308
  • 4
    There are _lots_ of things you might want to do here but you've given the context for almost none of them. Why do you create the connection _inside_ the `for` loop? – roganjosh Aug 08 '23 at 13:38
  • *"The average file size is about 3-4 Mb."*: I don't see any file reading in this code, so how is that relevant? It looks like you only work with file names, not with the file's contents. What am I missing? – trincot Aug 08 '23 at 13:45
  • +1. Create the connection outside the for loop. Maybe the cursor too. Moreover, I am confused by the 3 last lines. You fetchnone and you check if you get something ? Isn't that... strange ? – Itération 122442 Aug 08 '23 at 13:46
  • @Itération122442 they're just checking that the record doesn't exist. Presumably the meat of the actual query is not shown. There's a) no reason to be selecting all these fields to check if it exits, b) no reason to open a connection in a loop and c) no reason not to just do a bulk query and compare somehow (whether in SQL or outside of it). The question is poor. – roganjosh Aug 08 '23 at 13:48
  • @roganjosh, thanks, will move `for` out of loop. @trincot, you are right, I am checking the file path here only (will edit the question). – LA_ Aug 08 '23 at 13:48
  • @roganjosh, could you please clarify option (c) in details? – LA_ Aug 08 '23 at 13:51
  • @Itération122442 they `fetchone` not `fetchnone`, does `fetchnone` even exist? – Daviid Aug 08 '23 at 14:00
  • Well. I need to learn to read... – Itération 122442 Aug 08 '23 at 14:17
  • Your question is unclear to me. I presume `allMedia` is a list of filenames? How many filenames? You say something is 30Mb (megabits) but I presume you mean 30MB (megabytes)? Is each file 30MB? Or is the combined size of all the files 30MB? You say that processing a file takes 2-3s but you don't show what the processing is. You give no indication of your OS, nor of how many CPU cores you have, when it seems likely parallelisation might be useful.... – Mark Setchell Aug 08 '23 at 14:52
  • @Hermann12 do you just spout random SQL advice at every question? What does it have to do with _anything_ here? Particularly `commit()` on a `SELECT` – roganjosh Aug 08 '23 at 16:20

1 Answers1

2

If you are only interested in the records of the table where the file path matches, you could do this with one SQL query -- one that uses the in operator. You need to then prepare a list of ?:

con = lite.connect(DB_PATH)
con.text_factory = str
cur = con.cursor()
anchors = (",?" * len(allMedia))[1:]
sql = f"SELECT * FROM files WHERE path in ({anchors})"
result = cur.execute(sql, allMedia)

for row in result.fetchall():
    # do something with row
    pass

If you have lists (allMedia) that are longer than Sqlite supports for its in operator, then first split your list into chunks, and perform the above query for each chunk.

# From the linked Q&A
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
con = lite.connect(DB_PATH)
con.text_factory = str
cur = con.cursor()
for media in chunks(allMedia, 500):
    anchors = (",?" * len(media))[1:]
    sql = f"SELECT * FROM files WHERE path in ({anchors})"
    result = cur.execute(sql, media)
    for row in result.fetchall():
        # do something with row
        pass

If your allMedia lists is likely to have duplicates, and you don't need to process the same file twice, then first eliminate these duplicates from the collection being iterated:

for media in chunks(list(set(allMedia)), 500):
trincot
  • 317,000
  • 35
  • 244
  • 286
  • This helps save me some typing in my comments. NB: `IN` is limited to 1000 records in SQLite. It's a quirk of that dialect. – roganjosh Aug 08 '23 at 14:00
  • Oh right, well then the `allMedia` list needs to be chunked. – trincot Aug 08 '23 at 14:01
  • Sorry, not records, parameters. You cannot have more than 999 items in your `IN` clause - `anchors`. I misspoke then, sorry – roganjosh Aug 08 '23 at 14:02
  • 1
    Yes, I understood. But `anchors` is derived from `allMedia`, so splitting `allMedia` in chunks will do the trick. Extended the answer. – trincot Aug 08 '23 at 14:03
  • 1
    BTW: Starting with python-3.12, you can use [`itertools.batched`](https://docs.python.org/3.12/library/itertools.html#itertools.batched) for this – Homer512 Aug 08 '23 at 14:21