Is there a more efficient way to iterate a collection of files and build a dictionary of file contents?

Question

I have the following code. Is there a more efficient way to accomplish the same tasks?

Given a folder, loop over the files within the folder.
Within each file, skip the first four header lines,
After splitting the row based on a space, if the resulting array contains less than 7 elements, skip it,
Check if the specified element is already in the dictionary. If it is, increment the count. If not, add it.

It's not a complicated process. Is there a better way to do this? LINQ?

string sourceDirectory = @"d:\TESTDATA\";

string[] files = Directory.GetFiles(sourceDirectory, "*.log", 
    SearchOption.TopDirectoryOnly);

var dictionary = new Dictionary<string, int>();

foreach (var file in files)
{
    string[] lines = System.IO.File.ReadLines(file).Skip(4).ToArray();

    foreach (var line in lines)
    {
        var elements = line.Split(' ');

        if (elements.Length > 6)
        {
            if (dictionary.ContainsKey(elements[9]))
            {
                dictionary[elements[9]]++;
            }
            else
            {
                dictionary.Add(elements[9], 1);
            } 
        }
    }
}

Probably belongs on [CodeReview.SE](http://codereview.stackexchange.com/) — Corey Ogburn, Apr 26 '13 at 18:03
Sounds like you're using these files as a kind of data source. Depending on their size, you may do well using the ADO text driver. It lets you use query flat files almost like SQL tables. It can be a bit tricky to get it working, but once you have it configured right, it's really fast and easy to use. — mroach, Apr 26 '13 at 18:05
You do `if (elements.Length > 6)` and go on to access `elements[9]` if it is true. So what will happen when the length is 7, 8 or 9? — Matthew Watson, Apr 26 '13 at 18:22

score 1 · Answer 1 · answered Apr 26 '13 at 18:50

Something Linqy should do you. Doubt its any more efficient. And, it's almost certainly more of a hassle to debug. But it is very trendy these days:

static Dictionary<string,int> Slurp( string rootDirectory )
{
  Dictionary<string,int> instance = Directory.EnumerateFiles(rootDirectory,"*.log",SearchOption.TopDirectoryOnly)
                                             .SelectMany( fn => File.ReadAllLines(fn)
                                                                    .Skip(4)
                                                                    .Select( txt => txt.Split( " ".ToCharArray() , StringSplitOptions.RemoveEmptyEntries) )
                                                                    .Where(x => x.Length > 9 )
                                                                    .Select( x => x[9])
                                                        )
                                             .GroupBy( x => x )
                                             .ToDictionary( x => x.Key , x => x.Count()) 
                                             ;
  return instance ;
}

score 0 · Answer 2 · edited May 23 '17 at 11:49

0

A more efficient (performance-wise) way to do this would be to parallelize your outer foreach with the Parallel.Foreach method. You'd also need a ConcurrentDictionary object instead of a standard dictionary.

edited May 23 '17 at 11:49

Community

1
1

answered Apr 26 '13 at 18:23

Tim P.

2,903
24
26

score 0 · Answer 3 · answered Apr 26 '13 at 18:39

Not sure if you are looking for better performance or more elegant code. If you prefer functional style linq, something like this maybe:

var query= from element in
                       (
                           //go through all file names
                           from fileName in files
                           //read all lines from every file and skip first 4
                           from line in File.ReadAllLines(fileName).Skip(4)
                           //split every line into words
                           let lineData = line.Split(new[] {' '})
                           //select only lines with more than 6 words
                           where lineData.Count() > 6
                           //take 6th element from line
                           select lineData.ElementAt(6)
                       )
                   //outer query will group by element
                   group element by element
                   into g
                   select new
                       {
                           Key = g.Key,
                           Count = g.Count()
                       };
  var dictionary =  list.ToDictionary(e=>e.Key,e=>e.Count);

Result is dictionary with word as Key and count of word occrances as value.

score 0 · Answer 4 · answered Apr 26 '13 at 19:07

I would expect that reading the files is going to be the most consuming part of the operation. In many cases, trying to read multiple files at once on different threads will hurt rather than help performance, but it may be helpful to have a thread which does nothing but read files, so that it can keep the drive as busy as possible.

If the files could get big (as seems likely) and if no line will exceed 32K bytes (8,000-32,000 Unicode characters), I'd suggest that you read them in chunks of about 32K or 64K bytes (not characters). Reading a file as bytes and subdividing into lines yourself may be faster than reading it as lines, since the subdivision can happen on a different thread from the physical disk access.

I'd suggest starting with one thread for disk access and one thread for parsing and counting, with a blocking queue between them. The disk-access thread should put on the queue data items which contain a 32K array of bytes, an indication of how many bytes are valid [may be less than 32K at the end of a file], and an indicator of whether it's the last record of a file. The parsing thread should read those items, parse them into lines, and update the appropriate counts.

To improve counting performance, it may be helpful to define

class ExposedFieldHolder<T> {public T Value; }

and then have a Dictionary<string, ExposedFieldHolder<int>>. One would have to create a new ExposedFieldHolder<int> for each dictionary slot, but dictionary[elements[9]].Value++; would likely be faster than dictionary[elements[9]]++; since the latter statement translates as dictionary[elements[9]] = dictionary[elements[9]]+1;, and has to look up the element once when reading and again when writing].

If it's necessary to do the parsing and counting on multiple threads, I would suggest that each thread have its own queue, and the disk-reading thread switch queues after each file [all blocks of a file should be handled by the same thread, since a text line might span two blocks]. Additionally, while it would be possible to use a ConcurrentDictionary, it might be more efficient to have each thread have its own independent Dictionary and consolidate the results at the end.

Is there a more efficient way to iterate a collection of files and build a dictionary of file contents?

4 Answers4