0

What is the fastest way to get lists of files and search through file lists repeatedly?

Situation:

  1. There can be 5,000 to 35,000 files spread over 3-5 root directories (that have many subdirectories) on a network drive.
  2. There are three file types (tif, sus, img) that a user may or may not search for. Some file types can have 2-3 different file extensions.
  3. From a list of files (in various database tables), I need to find out if each file exists, and if it does, save the path only and filename only to a table.
  4. Searches on file names must be case sensitive but it would be nice to keep original case when saving the path to the table.

Environment:

C# and .NET 4.0 on Windows PC.

Is this the fastest way?:

Is the fastest way to use a dictionary, with FileName as a key (lowercase) and Path as a value (original case)? In this way I can get the index/Path at the same pass when I search for the filename? The FileName and Path are split up front when populating the list.

if (d.TryGetValue("key", out value))
{
    // Log "key" and value to table  // only does one lookup
}

Note: I am a bit concerned that I probably will have duplicate key values per FileType. When/If I run across this scenario what type of list and access method should I use?

Maybe on these rare cases, I should populate another list of the duplicate keys. Because I will need to do at least one of: log/copy/delete of the files in any path.

user610064
  • 481
  • 2
  • 11
  • 25
  • I would probably create an object with "FullFilePath" and "FileName"/"Path" getters that extract those details from the path. Then, create a collection of those objects that you can iterate over. --but that's just me. – Brad Christie Nov 16 '11 at 14:19
  • @BradChristie So... the [`FileInfo`](http://msdn.microsoft.com/en-us/library/system.io.fileinfo.aspx) class, then? – qJake Nov 16 '11 at 14:34
  • @SpikeX: I thought about it, but FileInfo may be to heavy just for meta-data. It depends what you want to performance. I almost think a dumb struct set up in an [easily-searchable collection](http://stackoverflow.com/questions/1009107/what-net-collection-provides-the-fastest-search) – Brad Christie Nov 16 '11 at 14:36
  • @BradChristie But that's all a FileInfo object is, is metadata. It doesn't store the contents of the file at all. They're designed to be very lightweight and quick, and I've processed *hundreds of thousands* of files with the FileInfo class and had no problems whatsoever. -- **Don't reinvent the wheel.** ;) – qJake Nov 16 '11 at 19:14

2 Answers2

1

I would use a Dictionary<string,string> with the FullName (path+file+ext) changed to lower case as key and the FullName unchanged as value. Then split the required parts using the static methods GetDirectoryName and GetFileName of the System.IO.Path class before inserting them into the table.

EDIT: The GetFiles method of the DirectoryInfo class returns an array of FileInfo. FileInfo has a FullName property returning path+file+ext. You could as well store this FileInfo as value in your dictionary if memory consumption is not an issue. FileInfo has a DirectoryName and a Name property returning the two parts you need.

EDIT: Here is my implementation of a multimap which does the Directory<TKey,List<TValue>> stuff:

/// <summary>
/// Represents a collection of keys and values. Multiple values can have the same key.
/// </summary>
/// <typeparam name="TKey">Type of the keys.</typeparam>
/// <typeparam name="TValue">Type of the values.</typeparam>
public class MultiMap<TKey, TValue> : Dictionary<TKey, List<TValue>>
{

    public MultiMap()
        : base()
    {
    }

    public MultiMap(int capacity)
        : base(capacity)
    {
    }

    /// <summary>
    /// Adds an element with the specified key and value into the MultiMap. 
    /// </summary>
    /// <param name="key">The key of the element to add.</param>
    /// <param name="value">The value of the element to add.</param>
    public void Add(TKey key, TValue value)
    {
        List<TValue> valueList;

        if (TryGetValue(key, out valueList)) {
            valueList.Add(value);
        } else {
            valueList = new List<TValue>();
            valueList.Add(value);
            Add(key, valueList);
        }
    }

    /// <summary>
    /// Removes first occurence of a element with a specified key and value.
    /// </summary>
    /// <param name="key">The key of the element to remove.</param>
    /// <param name="value">The value of the element to remove.</param>
    /// <returns>true if the a element is removed; false if the key or the value were not found.</returns>
    public bool Remove(TKey key, TValue value)
    {
        List<TValue> valueList;

        if (TryGetValue(key, out valueList)) {
            if (valueList.Remove(value)) {
                if (valueList.Count == 0) {
                    Remove(key);
                }
                return true;
            }
        }
        return false;
    }

    /// <summary>
    /// Removes all occurences of elements with a specified key and value.
    /// </summary>
    /// <param name="key">The key of the elements to remove.</param>
    /// <param name="value">The value of the elements to remove.</param>
    /// <returns>Number of elements removed.</returns>
    public int RemoveAll(TKey key, TValue value)
    {
        List<TValue> valueList;
        int n = 0;

        if (TryGetValue(key, out valueList)) {
            while (valueList.Remove(value)) {
                n++;
            }
            if (valueList.Count == 0) {
                Remove(key);
            }
        }
        return n;
    }

    /// <summary>
    /// Gets the total number of values contained in the MultiMap.
    /// </summary>
    public int CountAll
    {
        get
        {
            int n = 0;

            foreach (List<TValue> valueList in Values) {
                n += valueList.Count;
            }
            return n;
        }
    }

    /// <summary>
    /// Determines whether the MultiMap contains a element with a specific key / value pair.
    /// </summary>
    /// <param name="key">Key of the element to search for.</param>
    /// <param name="value">Value of the element to search for.</param>
    /// <returns>true if the element was found; otherwise false.</returns>
    public bool Contains(TKey key, TValue value)
    {
        List<TValue> valueList;

        if (TryGetValue(key, out valueList)) {
            return valueList.Contains(value);
        }
        return false;
    }

    /// <summary>
    /// Determines whether the MultiMap contains a element with a specific value.
    /// </summary>
    /// <param name="value">Value of the element to search for.</param>
    /// <returns>true if the element was found; otherwise false.</returns>
    public bool Contains(TValue value)
    {
        foreach (List<TValue> valueList in Values) {
            if (valueList.Contains(value)) {
                return true;
            }
        }
        return false;
    }

}
Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
  • When I am searching for a file I only have the file name from the database table. so shouldn't the key just be the file name? Since performance really matters I would rather match on the whole string. right? – user610064 Nov 16 '11 at 14:53
  • So you know only the filename, but don't know in which directory it is? Then you could use a `Directory>` or a `Directory>` to store multiple file informations for the same file name. – Olivier Jacot-Descombes Nov 16 '11 at 14:57
  • I am guessing (don't know for sure) but since duplicates will be rare I think having 2 lists is faster per FileType. Normal list = Dictonary>. But for every positive match I will have to check the other list. hmmm – user610064 Nov 16 '11 at 15:04
  • Having 2 dicts is complicated and will probably not be significantly faster (if at all). Use my implementation of `MultiMap` it's easy and will be fast enough (see my EDIT). Note that `TryGetValue` returns a `List` not a `TValue`. – Olivier Jacot-Descombes Nov 16 '11 at 15:13
  • Thanks! Will try this today/tomorrow and get back to you. – user610064 Nov 16 '11 at 15:18
0

I would probably use a dictionary with filename lowercased as key. Value would be a class with the needed extra information. I would also search it like your example. If this was slow I would probably also try searching with linq just to see if it was faster. This is however one problem here; this requires that all files through all folders are uniquely named. That might be the case for you, but it could also be a problem if you haven't already considered it ;)

Remember that you can also use a FileWatcher object to keep the memory dictionary/list synchronized with the disk contents if it is subject to change. If it's static I would probably store it all in a database table and search that instead, startup of your program would then be instatanious.

Edit: Just now noticed your conscern for duplicates. If that's a problem I would create a List where fileclass is a class containing needed information on the files. Then search the list using linq as that could give you zero, one or more hits. I think that would be more efficient than a dictionary with a list as value, where the list would contain one or more items (duplicates).

sindre j
  • 4,404
  • 5
  • 31
  • 32
  • How do you feel about adding the overhead of LINQ with regards to speed? (just curious) – Brad Christie Nov 16 '11 at 14:49
  • The files are indeed static - FileWatcher will not be needed. – user610064 Nov 16 '11 at 14:54
  • I have considered duplicate keys and think the fastest solution may be another dictionary for duplicates List. My database I am using is MS Access 2007 so I think a dictionary list is much faster than searching this Db table. Do I really need a class? I just need to store FileName, Path. Do you think the List idea would be faster than 2 dictionaries for my simple purposes? – user610064 Nov 16 '11 at 15:00