3

I am looking for a time efficient method to parse a list of files into a tree. There can be hundreds of millions of file paths.

The brute force solution would be to split each path on occurrence of a directory separator, and traverse the tree adding in directory and file entries by doing string comparisons but this would be exceptionally slow.

The input data is usually sorted alphabetically, so the list would be something like:

C:\Users\Aaron\AppData\Amarok\Afile

C:\Users\Aaron\AppData\Amarok\Afile2

C:\Users\Aaron\AppData\Amarok\Afile3

C:\Users\Aaron\AppData\Blender\alibrary.dll

C:\Users\Aaron\AppData\Blender\and_so_on.txt

From this ordering my natural reaction is to partition the directory listings into groups... somehow... before doing the slow string comparisons. I'm really not sure. I would appreciate any ideas.

Edit: It would be better if this tree were lazy loaded from the top down if possible.

Community
  • 1
  • 1
Abignale
  • 195
  • 8
  • Why do you think it would be exceptionally slow? If there are n lines, and each line is up to m characters (so there are <= m directory components), this will only take time O(nm). For each line, insert it into the trie which has depth at most m. nm is the size of the input data, so it's linear. – p00ya Mar 15 '10 at 05:27
  • See, https://stackoverflow.com/questions/45687209/convert-file-path-list-to-tree – YonghanKim Feb 10 '21 at 06:54
  • You can probably find help here, https://stackoverflow.com/questions/45687209/convert-file-path-list-to-tree – YonghanKim Feb 10 '21 at 06:55

3 Answers3

1

You have no choice but to do full string comparisons since you can't guarantee where the strings might differ. There are a couple tricks that might speed things up a little:

  • As David said, form a tree, but search for the new insertion point from the previous one (perhaps with the aid of some sort of matchingPrefix routine that will tell you where the new one differs).
  • Use a hash table for each level of the tree if there may be very many files within and you need to count duplicates. (Otherwise, appending to a stack is fine.)
Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
0

if its possible, you can generate your tree structure with the tree command, here

ghostdog74
  • 327,991
  • 56
  • 259
  • 343
0

To take advantage of the "usually sorted" property of your input data, begin your traversal at the directory where your last file was inserted: compare the directory name of current pathname to the previous one. If they match, you can just insert here, otherwise pop up a level and try again.

David Gelhar
  • 27,873
  • 3
  • 67
  • 84