32

I'm trying to implement an SQLite-based database that can store the full structure of a 100GB folder with a complex substructure (expecting 50-100K files). The main aim of the DB would be to get rapid queries on various aspects of this folder (total size, size of any folder, history of a folder and all it's contents, etc).

However, I realized that finding all the files inside a folder, including all of it's sub-folders is not possible without recursive queries if I just make a "file" table with just a parent_directory field. I consider this as one of the most important features I want in my code, so I have considered two schema options for this as shown in the figure below.

  1. In schema 1, I store all the file names in one table and directory names in another table. They both have a "parentdir" item, but also have a text (apparently text/blob are the same in sqlite) field called "FullPath" that will save the entire path from the root to the particular file/directory (like /etc/abc/def/wow/longpath/test.txt). I'm not assuming a maximum subfolder limit so this could theoretically be a field that allows up to 30K characters. My idea is that then if I want all the files or directories belonging to any parent I just query the fullpath of the parent on this field and get the fileIDs

  2. In schema 2, I store only filenames, fileIDs and DirNames, DirIDs in the directories and files tables, respectively. But in a third table called "Ancestors", I store for each file a set of entries for each directory that is it's ancestor (so in the above example, test.txt will have 5 entries, pointing to the DirIDs of the folders etc,abc,def,wow and longpath respectively). Then if I want the full contents of any folder I just look for the DirID in this table and get all the fileIDs.

I can see that in schema 1 the main limit might be full-text search of variable length text column and in schema 2 the main limit being that I might have to add a ton of entries for files that are buried deep within 100 directories or something.

What would be the best of these solutions? Is there any better solution that I did not think of?

Two possible schemas to keep rapid allow rapid retrieval of *all* the descendants of a directory in a complex directory structure

user930916
  • 411
  • 1
  • 4
  • 6
  • 8
    You might be interested in http://dirtsimple.org/2010/11/simplest-way-to-do-tree-based-queries.html – Dan D. Oct 28 '12 at 04:00
  • Wow that was exactly what I wanted! So the second solution I showed is somewhat similar to what he's describing but he describes also extremely elegant triggers that would keep all the data fully sane without any external code sanitization! I think I will go with that design! – user930916 Oct 28 '12 at 14:56

2 Answers2

24
  1. Your first schema will work just fine. When you put an index on the FullPath column, use either the case-sensitive BETWEEN operator for queries, or use LIKE with either COLLATE NOCASE on the index or with PRAGMA case_sensitive_like.

    Please note that this schema also stores all parents, but the IDs (the names) are all concatenated into one value.

    Renaming a directory would require updating all its subtree entries, but you mention history, so it's possible that old entries should stay the same.

  2. Your second schema is essentially the Closure Table mentioned in Dan D's comment. Take care to not forget the entries for depth 0.

    This will store lots of data, but being IDs, the values should not be too large.

    (You don't actually need RelationshipID, do you?)

  3. Another choice for storing trees is the nested set model, or the similar nested interval model. The nested set model allows to retrieve subtrees by querying for an interval, but updates are hairy. The nested interval model uses fractions, which are not a native data type and therefore cannot be indexed.

I'd estimate that the first alternative would be easiest to use. I should also be no slower than the others if lookups are properly indexed.

CL.
  • 173,858
  • 17
  • 217
  • 259
  • Thanks for your detailed description on the 3 ideas! I actually think I will go with the Closure Table, it just somehow feels more elegant and how the data should really be in the database (and that DirtSimple link has what looks like really cool triggers that will do all the book-keeping on the database itself, though I will spend some time thinking through that very hard). While the firs idea also looks good, I'm just a little weary about doing string manipulations and feel the other option might be more cooler. I'll probably run some tests though, space is also a constraint for me – user930916 Oct 28 '12 at 14:58
  • And the RelationshipID had to be added because MS Access won't let me make a table without a primary key and I found it the simplest tool to create Relationship diagrams :) – user930916 Oct 28 '12 at 15:00
  • The 'correct' primary key for that table would be the combination of `DirID` and `FileID`. – CL. Oct 28 '12 at 16:33
  • @CL have you read either Joe Celko's [_Trees and Hierarchies in SQL for Smarties_](http://www.amazon.com/Hierarchies-Smarties-Kaufmann-Management-Systems-ebook/dp/B006Y8MKUU/ref=mt_kindle?_encoding=UTF8&me=) (2012) - totally worth the e-reader price - or ["Managing Hierarchical Data in MySQL"](http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/) (2012). Neither of them make updates seem "hairy" in the nested model. As I am looking at implementing this design, what makes the model hairy to update? Is it just that extra care and thought must be taken? – Thomas Mar 16 '16 at 17:34
  • @Thomas "rearranging the structure of the tree is tricky because figuring out the (lft, rgt) nested sets numbers requires a good bit of algebra in a large tree." It's easier if you have the book. :) – CL. Mar 16 '16 at 18:04
6

My personal favourite is the visitation number approach, which I think would be especially useful for you since it makes it pretty easy to run aggregate queries against a record and its descendants.

Community
  • 1
  • 1
Joel Brown
  • 14,123
  • 4
  • 52
  • 64