1

I have a text file stored as a string variable. The text file is processed so that it only contains lowercase words and spaces. Now, say I have a static dictionary, which is just a list of specific words, and I want to count, from within the text file, the frequency of each word in the dictionary. For example:

Text file:

i love love vb development although i m a total newbie

Dictionary:

love, development, fire, stone

The output I'd like to see is something like the following, listing both the dictionary word and its count. If it makes coding simpler, it can also only list the dictionary word that appeared in the text.

===========

WORD, COUNT

love, 2

development, 1

fire, 0

stone, 0

============

Using a regex (eg "\w+") I can get all the word matches, but I have no clue how to get the counts that are also in the dictionary, so I'm stuck. Efficiency is crucial here since the dictionary is quite large (~100,000 words) and the text files are not small either (~200kb each).

I appreciate any kind help.

johnv
  • 99
  • 1
  • 2
  • 7
  • Maybe something like splitting the string into an `Array` or a `List` and then iterating/processing the list? – Uwe Keim Dec 23 '10 at 17:08
  • You've tagged this as both c# and vb.net. Which is it? – Rob Stevenson-Leggett Dec 23 '10 at 17:10
  • 1
    FWIW, using a regex here to match the words is not a good idea, especially since you indicate in the question that the input is clean (lower case letters and spaces only.) Use String.Split instead. Aside from that this really is a trivial problem. Look up Dictionary in the .NET documentation. – Mike Jones Dec 23 '10 at 17:19
  • 1
    @pcantin: Do they use 100,000 word dictionaries in homework these days? Granted, college was 30 years ago for me, but that still seems awfully large and detailed for homework...? – RBarryYoung Dec 23 '10 at 17:20
  • @RBarryYoung since you can easily download a complete dictionary from Project Gutenberg, there's no real reason NOT to use it. – 3Dave Dec 23 '10 at 17:30
  • @David Lively: Cool! Got a pointer to that dictionary? :-) – RBarryYoung Dec 23 '10 at 17:47
  • @pcantin: Wow seriously? That was completely uncalled for. How can YOU be so sure this is homework? :| – dnclem Dec 25 '11 at 15:49

4 Answers4

6

You can count the words in the string by grouping them and turning it into a dictionary:

Dictionary<string, int> count =
  theString.Split(' ')
  .GroupBy(s => s)
  .ToDictionary(g => g.Key, g => g.Count());

Now you can just check if the words exist in the dictionary, and show the count if it does.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
5
var dict = new Dictionary<string, int>();

foreach (var word in file)
  if (dict.ContainsKey(word))
    dict[word]++;
  else
    dict[word] = 1;
fejesjoco
  • 11,763
  • 3
  • 35
  • 65
0

Using Groovy regex facilty, i would do it as below :-

def input="""
    i love love vb development although i m a total newbie
"""

def dictionary=["love", "development", "fire", "stone"]


dictionary.each{
    def pattern= ~/${it}/
    match = input =~ pattern
    println "${it}" + "-"+ match.count
}
3Dave
  • 28,657
  • 18
  • 88
  • 151
Rishi
  • 224
  • 2
  • 3
  • 11
0

Try this. The words variable is obviously your string of text. The keywords array is a list of keywords you want to count.

This won't return a 0 for dictionary words that aren't in the text, but you specified that this behavior is okay. This should give you relatively good performance while meeting the requirements of your application.

string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };

Regex regex = new Regex("\\w+");

var frequencyList = regex.Matches(words)
    .Cast<Match>()
    .Select(c => c.Value.ToLowerInvariant())
    .Where(c => keywords.Contains(c))
    .GroupBy(c => c)
    .Select(g => new { Word = g.Key, Count = g.Count() })
    .OrderByDescending(g => g.Count)
    .ThenBy(g => g.Word);

//Convert to a dictionary
Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);

//Or iterate through them as is
foreach (var item in frequencyList)
    Response.Write(String.Format("{0}, {1}", item.Word, item.Count));

If you want to achieve the same thing without using RegEx since you indicated you know everything is lower case and separated by spaces, you could modify the above code like so:

string words = "i love love vb development although i m a total newbie";
string[] keywords = new[] { "love", "development", "fire", "stone" };

var frequencyList = words.Split(' ')
    .Select(c => c)
    .Where(c => keywords.Contains(c))
    .GroupBy(c => c)
    .Select(g => new { Word = g.Key, Count = g.Count() })
    .OrderByDescending(g => g.Count)
    .ThenBy(g => g.Word);

Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);
Scott
  • 13,735
  • 20
  • 94
  • 152