7

I am working on a desktop application that is much like WinDirStat or voidtools' Everything - it maps hard drives, i.e. creates a deeply nested dictionary out of the directory tree.

The desktop application should then store the directory trees in some kind of database, so that a web application can be used to browse them from root, depth level by depth level.

Assume both applications run locally on the same machine for the time being.

The question that comes to mind is how the data should be structured and what database should be utilized, considering: 1) RAM consumption should be reasonable 2) The time it takes to for the directory to be ready for viewing in the web application should be minimal

P.S - My initial approach was serializing each file system node to JSON separately and inserting each into Mongo, with object references linking them to their children. That way the web application could easily load the data based on user demand. However, I am worried that making so many (a million, by average) independent inserts to Mongo will take a lot of time; if I make bulk inserts that means I have to keep each bulk in memory.

I also considered dumping the entire tree as one deeply nested JSON, but the data is too large to be a Mongo document. GridFS can be used to store it, but then I would have the load the entire tree in the web application even though the deep nodes may not be of interest.

Kfir Eichenblat
  • 449
  • 2
  • 8
  • 27
  • Also: See this old thread of mine with a some great suggestions how to store trees in relational databases. https://stackoverflow.com/questions/192220/what-is-the-most-efficient-elegant-way-to-parse-a-flat-table-into-a-tree – Tomalak Oct 17 '18 at 17:15
  • Do you need to keep some information for higher nodes (not leaves)? I.e. what's the total size of content in all child nodes? – Boris Schegolev Oct 18 '18 at 08:37
  • Yes, the higher nodes are the more relevant ones. Just think about your Windows Explorer view, starting from C: and going deeper on demand. Not sure what data you're referring to, but each node consists of a few fields (e.g. name, size, attributes) and there are normally more than a million nodes in total – Kfir Eichenblat Oct 18 '18 at 09:07

1 Answers1

4

Given your requirements of:

  • A) Low RAM usage
  • B) Meeting file size limitations in Mongo
  • C) A responsive UI

I'd consider something along the lines of the following.

Take this example directory

C:\
C:\X\
C:\X\B\
C:\X\file.txt
C:\Y\
C:\Y\file.pdf
C:\Y\R\
C:\Y\R\file.js

In JSON it could possibly be represented as:

{
    "C:": {
        "X": {
            "B": {},
            "file.txt": "file information..."
        },
        "Y": {
            "file.pdf": "file information...",
            "R": {
                "file.js": "file information..."
            }
        }
    }
}

The latter, as you pointed out, does not scale well with large directory structures (I can tell you first hand that browsers will not appreciate a JSON blob representing even a modest directory with a few thousand files/folders). The former, though akin to some actual filesystems and efficient in the right context, is a pain to work with converting to and from JSON.

My proposal is to break each directory into a separate JSON document, as this will address all three issues, however nothing is free, and this will increase code complexity, number of requests per session, etc.

The above structure could be broken into the following documents:

[
    {
        "id": "00000000-0000-0000-0000-000000000000",
        "type": "d",
        "name": "C:",
        "children": [
            "11111111-1111-1111-1111-111111111111",
            "22222222-2222-2222-2222-222222222222"
        ]
    },
    {
        "id": "11111111-1111-1111-1111-111111111111",
        "type": "d",
        "name": "X",
        "children": [
            "33333333-3333-3333-3333-333333333333",
            "55555555-5555-5555-5555-555555555555"
        ]
    },
    {
        "id": "22222222-2222-2222-2222-222222222222",
        "type": "d",
        "name": "Y",
        "children": [
            "44444444-4444-4444-4444-444444444444",
            "66666666-6666-6666-6666-666666666666"
        ]
    },
    {
        "id": "33333333-3333-3333-3333-333333333333",
        "type": "d",
        "name": "B",
        "children": []
    },
    {
        "id": "44444444-4444-4444-4444-444444444444",
        "type": "d",
        "name": "R",
        "children": [
            "77777777-7777-7777-7777-777777777777"
        ]
    },
    {
        "id": "55555555-5555-5555-5555-555555555555",
        "type": "f",
        "name": "file.txt",
        "size": "1024"
    },
    {
        "id": "66666666-6666-6666-6666-666666666666",
        "type": "f",
        "name": "file.pdf",
        "size": "2048"
    },
    {
        "id": "77777777-7777-7777-7777-777777777777",
        "type": "f",
        "name": "file.js",
        "size": "2048"
    }
]

Where each document represents a directory or file and (if directory) its immediate child IDs. Child items can be lazy loaded using their IDs and appended to their parent in the UI. Well implemented lazy loading can pre load child nodes to a desired depth, creating a very responsive UI. RAM usage is minimal as your sever only has to handle small payloads per request. The number of requests does go up considerably versus a single document approach, but again, some clever lazy-loading can cluster requests and reduce the total number.

UPDATE 1: I somehow overlooked your second last paragraph before answering, so this is probably more or less what you had in mind. To address the issue of too many documents some level of clustering nodes within documents may be in order. I have to head off now but I'll give it some thought.

UPDATE 2: I've created a gist of a simplified version of the clustering concept I mentioned. It doesn't take into account files, just folders, and it isn't doesn't include and code to update the documents. Hopefully it'll give you some ideas, I'll continue to update it for my own purposes.

Gist: tree_docs_cluster.js

Richard Dunn
  • 6,165
  • 1
  • 25
  • 36
  • After a lot of research I realized that this would be the best way to go. My only concern with this approach was the time it would take to insert so many documents one by one. I realized I can speed things up with async insertions and I will likely buffer insertions so that it's not actually document by document. The tree itself is inserted in hierarchial order (i.e. top levels first) – Kfir Eichenblat Oct 23 '18 at 06:04
  • 1
    This problem interests me a lot, for various reasons. So I've been giving some thought to the issue of number of documents. The best I can think of so far is an AVL tree, where each node contains a _cluster_ of directory nodes and the tree is designed to _loosely_ follow the directory structure, in chunks defined by either document size, number of nodes, etc. Using a tree like this, vs other methods of clustering, has the benefit of ordering your data so that requests return documents containing sibling nodes. I'll update my answer with a clearer description of what I mean when I get a chance. – Richard Dunn Oct 23 '18 at 11:38
  • I appreciate your efforts. I'll definitely compare my current code to your gist for potential improvement. I've written a proof of concept that works fine (it's just the way I described it in my previous comment) – Kfir Eichenblat Oct 26 '18 at 15:20