1

I'm using Klepto archive to index specs of files in a folder tree. After scanning the tree, I want to quickly remove references to deleted files. But simply removing an item one-by-one from the file archive is extremely slow. Is there a way to sync the changes to the archive, or delete multiple keys at once? (The 'sync' method only appears to add new items)

The helpful answer by @Mike Mckerns to this question only deals with removing a single item: Python Saving and Editing with Klepto

Using files.sync() or files.dump() appears only to append data from the cache, not sync the deletes. Is there a way to delete keys from the cache and then sync those changes all-at-once. Individual deletes are far too slow.

Here's a working example:

from klepto.archives import *
import os

class PathIndex:
    def __init__(self,folder):
        self.folder_path=folder
        self.files=file_archive(self.folder_path+'/.filespecs',cache=False)
        self.files.load() #load memory cache

    def list_directory(self):
        self.filelist=[]
        for folder, subdirs, filelist in os.walk(self.folder_path): #go through every subfolder in a folder
            for filename in filelist: #now through every file in the folder/subfolder
                self.filelist.append(os.path.join(folder, filename))

    def scan(self):
        self.list_directory()
        for path in self.filelist:
            self.update_record(path)
        self.files.dump() #save to file archive

    def rescan(self):
        self.list_directory() #rescan original disk
        deletedfiles=[]

        #code to ck for modified files etc            
        #check for deleted files
        for path in self.files:
            try:
                self.filelist.remove(path)  #self.filelist - disk files - leaving list of new files
            except ValueError:
                deletedfiles.append(path)

        #code to add new files, the files left in self.filelist
        for path in deletedfiles:
            self.delete_record(path)
        #looking to here sync modified index from modifed to disk

    def update_record(self,path):
        self.files[path]={'size':os.path.getsize(path),'modified':os.path.getmtime(path)}
        #add other specs - hash of contents etc.

    def delete_record(self,path):
        del(self.files[path]) #delete from the memory cache
        #this next line slows it all down
        del(self.files.archive[path]) #delete from the disk cache

#usage
_index=PathIndex('/path/to/root')
_index.scan()
#delete, modify some files
_index.rescan()
starfish
  • 13
  • 3
  • I'm not sure I'm understanding what you want. Do you want to delete multiple keys from the cache with one method, then sync all the deleted keys (so that it removes the associated file entries) in the archive? I believe you could use some combination of `clear` and `load`, or `clear` and `dump`, or `sync(clear=True)`... depending on what you want to do. – Mike McKerns Feb 13 '19 at 23:21
  • If you provide a minimal self-contained example, I can provide a clearer answer -- with an example. – Mike McKerns Feb 14 '19 at 13:40
  • Many thanks - updated with a simplified version of a file indexer. The full version adds more specs, deals also with modified files and new files. And to avoid memory issues with huge directories, on the initial scan dumps to file archive in chunks. – starfish Feb 14 '19 at 21:37

1 Answers1

0

I see... you really are concerned about the speed of deleting one entry at at time from a file_archive.

Ok, I agree. Using __delitem__ or pop on a file_archive is a bit brutal when you want to delete several entries. The slowdown is due to the file_archive having to load and rewrite the entire file archive for each key you delete. This isn't the case for a dir_archive or many of the other archives... but for a file_archive it is. So that should be remedied...

UPDATE: I've added a new method that should enable faster dropping of specified keys...

>>> import klepto as kl
>>> ar = kl.archives.file_archive('foo.pkl')
>>> ar['a'] = 1
>>> ar['b'] = 2
>>> ar['c'] = 3
>>> ar['d'] = 4
>>> ar['e'] = 5
>>> ar.dump()
>>> ar.popkeys(list('abx'), None)
[1, 2, None]
>>> ar.sync(clear=True)
>>> ar
file_archive('foo.pkl', {'c': 3, 'e': 5, 'd': 4}, cached=True)
>>> ar.archive
file_archive('foo.pkl', {'c': 3, 'e': 5, 'd': 4}, cached=False)

Previously (i.e. in released versions), you could cheaply pop the keys you want from the local cache, and then do an ar.sync(clear=True) to remove the associated keys in the archive. However, doing that assumes you have all the keys you want to preserve in memory. So, instead of loading all the keys into memory, you can now (at least in the soon-to-be-released version) do popkeys both in the cache and/or the archive to delete any unwanted keys from either.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Thanks - yes, you got it; speed is the thing. I'll do as you suggest. – starfish Feb 15 '19 at 07:06
  • Great - that should do it thanks. Presumably - in your above - if you do use ar.popkeys() on the archive, then the line below ar.sync(clear=True) is no longer necessary. – starfish Feb 16 '19 at 11:02
  • Correct. The point is, `sync` is a different use case than dropping keys from both cache and archive. – Mike McKerns Feb 16 '19 at 12:30