Word frequency in a large text file

Question

I've am trying to read a large text file and output the distinct words in it along with it's count. I've tried a couple of attempts so far, and this is by far the fastest solution I have come up with.

private static readonly char[] separators = { ' ' };

public IDictionary<string, int> Parse(string path)
{
    var wordCount = new Dictionary<string, int>();

    using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var streamReader = new StreamReader(fileStream))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (wordCount.ContainsKey(word))
                {
                    wordCount[word] = wordCount[word] + 1;
                }
                else
                {
                    wordCount.Add(word, 1);
                }
            }
        }
    }

    return wordCount;
}

How I am measuring my solution

I have a 200MB text, which I know the total word count for (via a text editor). I'm using the Stopwatch class and counting the words to ensure accuracy and measuring the time taken. So far, it is taking around 9 seconds.

Other attempts

I have tried to utilise multithreading to split out the work via the TPL library. This involved batching multiple lines, sending out the processing of the batch of lines to a separate task and locking the read/write operations in the dictionary. This however, seems to not provide me any performance improvements.
It took around 30 seconds. I suspect the locking to read/write to the dictionary is too costly to gain any performance.
I also had a look at the ConcurrentDictionary type, but the AddOrUpdate method does require the calling code to handle the synchronization from my understanding, and brought no performance benefit.

I am sure there is a faster way to achieve this! Is there a better data structure to use for this problem?

Any suggestions/criticisms to my solution are welcome - trying to learn and improve here!

Cheers.

UPDATE: Here is the link to the test file i'm using.

What is your source document? 200MB of text is probably on the order of an entire encyclopedia! — Jason Watkins, May 16 '15 at 20:01
_batching multiple lines_ Maybe it would be better to partition the whol into the number of cores (n) and use n dictionaries not locking them. Then consilidating them into one could be a lot faster, especially with many repeated words — TaW, May 16 '15 at 20:02
This looks like it could be solved with map reduce paradigm effectively. Here is an answer which explains how to apply map reduce for pretty much same thing you are asking: http://stackoverflow.com/questions/12375761/good-map-reduce-examples-for-explanation — John, May 16 '15 at 20:02
It is exactly as I would have written it, line by line :-) Wait... I would have used a `TryGetValue` instead of the `ContainsKey` to have one less access to the dictionary. — xanatos, May 16 '15 at 20:02
@JasonWatkins It's just a test document I made up myself. It's a large article that I've repeated quite a few times. — Prabu, May 16 '15 at 20:04
@pwee167 Just out of curiousity, have you benchmarked the time the program would need to simply read the 200mb of file? Because 200mb/9sec = 21mb/sec... It is already good — xanatos, May 16 '15 at 20:06
Maybe try reading each line one by one and adding to a list/array, then use some form of parallelism (TPL's foreach perhaps) to perform the processing on each line in said list/array. You could try the concurrent dictionary for thread safe access https://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx — User92, May 16 '15 at 20:06
@xanatos No, I haven't benchmarked that. Is that because you are thinking read it all into memory then process? I'm also approaching this with the assumption that the file might be significantly larger, say in the order of gigabytes. — Prabu, May 16 '15 at 20:08
@pwee167 It would be interesting to comprehend if your program is disk limited or processor limited. If it is disk limited, then nothing you'll do will change it. If it is processor limited, there is a little chance to multithread it (but I don't think it is really possible to obtain something "big") — xanatos, May 16 '15 at 20:09
If it's truly CPU bound, multi-threading could very easily yield something "big" - two threads could cut run time in half, assuming the extra load on the IO system doesn't become the limiting factor. — glenebob, May 16 '15 at 20:38
Is the file available somewhere so that one could try creating a better word-count implementation that would work for your file? ie. do you have it on dropbox or similar (assuming it can be shared that is) — Lasse V. Karlsen, May 16 '15 at 20:53
@LasseV.Karlsen I'm uploading it to a dropbox location now. I'll update the post once it's done. — Prabu, May 16 '15 at 20:57
@LasseV.Karlsen Question updated with link to test data file. — Prabu, May 16 '15 at 21:10
You should measure your program more finegrained, here's how the time is being spent in your program: ~25% in string.Split, ~23.5% in dictionary lookup, ~19% in dictionary insert, ~17.5% in dictionary retrieval, and ~12.5% in reading the file: http://i.stack.imgur.com/d0Fv2.png — Lasse V. Karlsen, May 16 '15 at 21:37
I recommend migrating this question to [Code Review](http://codereview.stackexchange.com) though as it sounds to me like that would be a more appropriate site for this type of question. — Lasse V. Karlsen, May 16 '15 at 21:39
@LasseV.Karlsen Thanks for that. What tool did you use for this? I have placed it in Code Review. — Prabu, May 16 '15 at 21:42
I use [JetBrains dotTrace](https://www.jetbrains.com/profiler/) but there are many such tools (don't ask for a list, just use google). — Lasse V. Karlsen, May 16 '15 at 21:43

Lasse V. Karlsen · Accepted Answer · 2015-05-16T22:23:16.443

The best short answer I can give is to measure, measure, measure. Stopwatch is nice to get a feeling for where time is spent but eventually you'll end up sprinkling large swats of your code with it or you will have to find a better tool for this purpose. I would suggest getting a dedicated profiler tool for this, there are many available for C# and .NET.

I've managed to shave off about 43% of the total runtime in three steps.

First I measured your code and got this:

Original code measurements

This seems to indicate that there are two hotspots here that we can try to combat:

String splitting (SplitInternal)
Dictionary maintenance (FindEntry, Insert, get_Item)

The last piece of the time spent is in reading the file and I really doubt we can gain much by changing that part of the code. One other answer here mentions using specific buffersizes, I tried this and could not gain measurable differences.

The first, string splitting, is somewhat easy but involves rewriting a very simple call to string.Split into a bit more code. The loop that processes one line I rewrote to this:

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                // process word here
            }
            lastPos = index + 1;
        }
    }
}

I then rewrote the processing of one word to this:

int currentCount;
wordCount.TryGetValue(word, out currentCount);
wordCount[word] = currentCount + 1;

This relies on the fact that:

TryGetValue is cheaper than checking if the word exists and then retrieving its current count
If TryGetValue fails to get the value (key does not exist) then it will initialize the currentCount variable here to its default value, which is 0. This means that we don't really need to check if the word actually existed.
We can add new words to the dictionary through the indexer (it will either overwrite the existing value or add a new key+value to the dictionary)

The final loop thus looks like this:

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                int currentCount;
                wordCount.TryGetValue(word, out currentCount);
                wordCount[word] = currentCount + 1;
            }
            lastPos = index + 1;
        }
    }
}

The new measurement shows this:

new measurement

Details:

We went from 6876ms to 5013ms
We lost the time spent in SplitInternal, FindEntry and get_Item
We gained time spent in TryGetValue and Substring

Here's difference details:

difference

As you can see, we lost more time than we gained new time which resulted in a net improvement.

However, we can do better. We're doing 2 dictionary lookups here which involves calculating the hash code of the word, and comparing it to keys in the dictionary. The first lookup is part of the TryGetValue and the second is part of wordCount[word] = ....

We can remove the second dictionary lookup by creating a smarter data structure inside the dictionary at the cost of more heap memory used.

We can use Xanatos' trick of storing the count inside an object so that we can remove that second dictionary lookup:

public class WordCount
{
    public int Count;
}

...

var wordCount = new Dictionary<string, WordCount>();

...

string word = line.Substring(lastPos, index - lastPos);
WordCount currentCount;
if (!wordCount.TryGetValue(word, out currentCount))
    wordCount[word] = currentCount = new WordCount();
currentCount.Count++;

This will only retrieve the count from the dictionary, the addition of 1 extra occurance does not involve the dictionary. The result from the method will also change to return this WordCount type as part of the dictionary instead of just an int.

Net result: ~43% savings.

final results

Final piece of code:

public class WordCount
{
    public int Count;
}

public static IDictionary<string, WordCount> Parse(string path)
{
    var wordCount = new Dictionary<string, WordCount>();

    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
    using (var streamReader = new StreamReader(fileStream, Encoding.Default, false, 65536))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            int lastPos = 0;
            for (int index = 0; index <= line.Length; index++)
            {
                if (index == line.Length || line[index] == ' ')
                {
                    if (lastPos < index)
                    {
                        string word = line.Substring(lastPos, index - lastPos);
                        WordCount currentCount;
                        if (!wordCount.TryGetValue(word, out currentCount))
                            wordCount[word] = currentCount = new WordCount();
                        currentCount.Count++;
                    }
                    lastPos = index + 1;
                }
            }
        }
    }

    return wordCount;
}

i implemented pretty much the same thing as this answer. The original code took about 30 seconds for me (my laptop is an old pos). With the optimizations here, minus buffer size, I got it do to about 19 seconds. Changing buffer sizes to 64K saved about an additional 2 seconds. — glenebob, May 16 '15 at 22:47
Can you explain to me why you chose 65536 as the buffer size? — Prabu, May 17 '15 at 10:52

score 6 · Answer 2 · answered May 16 '15 at 20:09

Your approach seems to be in-line with how most people would tackle it. You're right to notice that using multi-threading did not offer any significant gains, because the bottleneck is most likely IO bound, and no matter what kind of hardware you have, you cannot read faster than your hardware supports.

If you're really looking for speed improvements (I doubt you will get any), you could try and implement a producer-consumer pattern where one thread reads the file and other threads process the lines (maybe then parallelise the checking of words in a line). The trade off here is you're adding a lot more complex code in exchange for marginal improvements (only benchmarking can determine this).

http://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem

edit: also have a look at ConcurrentDictionary

I had look at ConcurrentDictionary, which I thought might be useful, but from my research today I believe the `AddOrUpdate` method does not guarantee against dirty reads/writes. — Prabu, May 16 '15 at 20:17

xanatos · Answer 3 · 2015-05-16T22:26:21.347

I've gained quite much (from 25sec to 20sec on a file of 200mb) simply changing:

int cnt;

if (wordCount.TryGetValue(word, out cnt))
{
    wordCount[word] = cnt + 1;
}
else
....

A variant based on ConcurrentDictionary<> and Parallel.ForEach (using the IEnumerable<> overload). Note that instead of using an int, I'm using an InterlockedInt that uses Interlocked.Increment to increment itself. Being a reference type, it works correctly with the ConcurrentDictionary<>.GetOrAdd...

public class InterlockedInt
{
    private int cnt;

    public int Cnt
    {
        get
        {
            return cnt;
        }
    }

    public void Increment()
    {
        Interlocked.Increment(ref cnt);
    }
}

public static IDictionary<string, InterlockedInt> Parse(string path)
{
    var wordCount = new ConcurrentDictionary<string, InterlockedInt>();

    Action<string> action = line2 =>
    {
        var words = line2.Split(separators, StringSplitOptions.RemoveEmptyEntries);

        foreach (var word in words)
        {
            wordCount.GetOrAdd(word, x => new InterlockedInt()).Increment();
        }
    };

    IEnumerable<string> lines = File.ReadLines(path);
    Parallel.ForEach(lines, action);

    return wordCount;
}

Note that the use of Parallel.ForEach is less efficient than using directly one thread for each physical core (you can see how in the history). While both solutions take less than 10 seconds of "wall" clock on my PC, the Parallel.ForEach uses 55 seconds of CPU time against the 33 seconds of the Thread solution.

There is another trick that is valued around 5-10%:

public static IEnumerable<T[]> ToBlock<T>(IEnumerable<T> source, int num)
{
    var array = new T[num];
    int cnt = 0;

    foreach (T row in source)
    {
        array[cnt] = row;
        cnt++;

        if (cnt == num)
        {
            yield return array;
            array = new T[num];
            cnt = 0;
        }
    }

    if (cnt != 0)
    {
        Array.Resize(ref array, cnt);
        yield return array;
    }
}

You "group" the rows (choose a number between 10 and 100) in packets, so that there is less intra-thread communication. The workers then have to do a foreach on the received rows.

Nice. I'll take that improvement into the solution. I should have used it in the first place! — Prabu, May 16 '15 at 21:18
@pwee167 You know `++wordCount[word]` would also work, right? — Yuval Itzchakov, May 16 '15 at 21:21
@YuvalItzchakov You would return to the additional one read... It would be useless! — xanatos, May 16 '15 at 21:34
@xanatos It would be a super tiny micro optimization if you actually care about that. — Yuval Itzchakov, May 16 '15 at 21:39
You're missing an `else` though. Better if you get rid of the if altogether, as `cnt` will default to `0` if not found. — Lasse V. Karlsen, May 16 '15 at 21:40
@LasseV.Karlsen My replace was only for the initial part of the if... the else remains.. — xanatos, May 16 '15 at 21:40
If you keep line 1, line 3 (minus the if part) and line 5 you have the entire body of the word loop, no need for an `else`. — Lasse V. Karlsen, May 16 '15 at 21:41
On my machine this gets rid of around 15% time spent. Not insignificant but not that much either: http://i.stack.imgur.com/EUrTC.png — Lasse V. Karlsen, May 16 '15 at 21:46

score 2 · Answer 4 · answered May 16 '15 at 20:55

Using a text file at 200mb, the following took little over 5 seconds on my machine.

    class Program
{
    private static readonly char[] separators = { ' ' };
    private static List<string> lines;
    private static ConcurrentDictionary<string, int> freqeuncyDictionary;

    static void Main(string[] args)
    {
        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        string path = @"C:\Users\James\Desktop\New Text Document.txt";
        lines = ReadLines(path);
        ConcurrentDictionary<string, int> test = GetFrequencyFromLines(lines);

        stopwatch.Stop();
        Console.WriteLine(@"Complete after: " + stopwatch.Elapsed.TotalSeconds);
    }

    private static List<string> ReadLines(string path)
    {
        lines = new List<string>();
        using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
        {
            using (var streamReader = new StreamReader(fileStream))
            {
                string line;
                while ((line = streamReader.ReadLine()) != null)
                {
                    lines.Add(line);
                }
            }
        }
        return lines;            
    }

    public static ConcurrentDictionary<string, int> GetFrequencyFromLines(List<string> lines)
    {
        freqeuncyDictionary = new ConcurrentDictionary<string, int>();
        Parallel.ForEach(lines, line =>
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (freqeuncyDictionary.ContainsKey(word))
                {
                    freqeuncyDictionary[word] = freqeuncyDictionary[word] + 1;
                }
                else
                {
                    freqeuncyDictionary.AddOrUpdate(word, 1, (key, oldValue) => oldValue + 1);
                }
            }
        });

        return freqeuncyDictionary;
    }
}

I'm trying to avoid reading the whole file into memory, because the file can be much bigger in other situations. — Prabu, May 16 '15 at 21:08
Okay. In your question you mentioned that you tried reading the lines in batches and then use threading, but it ended up slower due to (you think) locking the dictionary. Did you try the concurrent dictionary in this scenario? Also, using the file you linked in the Q, it took around 7 seconds, which dropped to 6 when using TryGetValue — User92, May 16 '15 at 21:24
I had look at the ConcurrentDictionary type, which I thought might be useful initially, but from my research today I believe the AddOrUpdate method does not guarantee against dirty reads/writes. This means I need to do my own synchronization around the incrementing of the counts in the dictionary, which doesn't give me any performance gains. Hence, I went back to the standard dictionary type. — Prabu, May 16 '15 at 21:28

score 1 · Answer 5 · answered May 16 '15 at 20:05

1

If you're trying to count an specific word you can use the function strtok linked here and compare each word with the word that you are evaluating, I think that this isn't very costly but I never tried this with a big folder...

answered May 16 '15 at 20:05

Custodius

63
10

1

No, that is not what he wants. He wants to create a list of __all words__ with their count. does strtok do anything other than split? – TaW May 16 '15 at 20:07
OK I misundestood the question. strtok splits and after you compare, but for multiple words is more complex, sorry for the misundestood and thanks for the corection. – Custodius May 16 '15 at 20:39

score 1 · Answer 6 · answered May 16 '15 at 20:45

I recommend setting your stream buffer sizes larger, and matching:

    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 8192))
    using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, false, 8192))

First of all, your code results in buffers too small for this kind of work. Second, since the reader's buffer is smaller than the stream's buffer, the data is copied first to the stream's buffer, then to the reader's buffer. This can be a performance destroyer for the type of work you're doing.

When the buffer sizes match, the stream's buffer will never be used - in fact it will never be allocated.

Word frequency in a large text file

6 Answers6

Linked