I need to calculate how many times each keyword is reoccurring in a string, with sorting by highest number. What's the fastest algorithm available in .NET code for this purpose?
-
What language? I am sure there's no built in framework function to do exactly this, and the specifics of how you define "keyword" could complicate it, e.g. plurals, punctuation, and so on. This is an interesting algorithmic problem but the answer will depend on the programming language you use. – Jamie Treworgy Oct 27 '10 at 16:47
-
Both C# and VB.NET are acceptable for me. And currently ability to exclude unnecessary parts is not needed, all words are fine. – SharpAffair Oct 27 '10 at 16:51
4 Answers
EDIT: code below groups unique tokens with count
string[] target = src.Split(new char[] { ' ' });
var results = target.GroupBy(t => new
{
str = t,
count = target.Count(sub => sub.Equals(t))
});
This is finally starting to make more sense to me...
EDIT: code below results in count correlated with target substring:
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};
var results = target.Select((t, index) => new {str = t,
count = src.Select((c, i) => src.Substring(i)).
Count(sub => sub.StartsWith(t))});
Results is now:
+ [0] { str = "string", count = 4 } <Anonymous Type>
+ [1] { str = "the", count = 4 } <Anonymous Type>
+ [2] { str = "in", count = 6 } <Anonymous Type>
Original code below:
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};
var results = target.Select(t => src.Select((c, i) => src.Substring(i)).
Count(sub => sub.StartsWith(t))).OrderByDescending(t => t);
with grateful acknowledgement to this previous response.
Results from debugger (which need extra logic to include the matching string with its count):
- results {System.Linq.OrderedEnumerable<int,int>}
- Results View Expanding the Results View will enumerate the IEnumerable
[0] 6 int
[1] 4 int
[2] 4 int

- 1
- 1

- 53,498
- 9
- 91
- 140
-
I wonder how it would perform compared to a brute force method (e.g. just loop through the keywords you're seeking, use IndexOf to find occurrences, and count them to a collector array)? In no way do I mean to take away from the awesomeness of this solution, I am just curious since I don't have a good sense for the efficiency of linq. – Jamie Treworgy Oct 27 '10 at 17:13
-
-
-
This is all so beautiful it makes me want to cry. I really need to use LINQ more. – Jamie Treworgy Oct 27 '10 at 17:59
-
I know, I know next to nothing about it so I am using the qs here as a classroom - standing on the shoulders of giants. – Steve Townsend Oct 27 '10 at 18:00
-
Thanks Steve, another quick question - currently it looks for specific keywords ("string", "the", "in"), can we modify it to retrieve ALL keywords from a string? – SharpAffair Oct 27 '10 at 18:05
-
@Sphynx - you mean count the tokens separated by space? You can probably use `Split` and then `GroupBy` per @KeithS, just do not filter on input string list. Take out the `where` clause in his code and try it. – Steve Townsend Oct 27 '10 at 18:08
-
I removed the "where" clause, but his code gives few errors, such as "Invalid expression term" in "by" keyword and "The best overloaded method match for 'string.Split(params char[])' has some invalid arguments" – SharpAffair Oct 27 '10 at 18:15
-
@Sphynx - the `...` in that code calling `Split()` marks it as just a model, intended for you to add more separators if you wish. Just make this into a well-formed list of chars and it should work. Just use " " and remove the rest for now if you like. – Steve Townsend Oct 27 '10 at 18:18
-
@Sphynx. That code won't compile for me at all. Let me sort this out. – Steve Townsend Oct 27 '10 at 18:20
-
-
1@Sphynx - fyi there is a more elegant variation for this at http://stackoverflow.com/questions/4038836/minimize-linq-string-token-counter – Steve Townsend Oct 28 '10 at 01:55
-
I have upvoted your comment as well. Just one final dumb question :) How to enumerate values of "results"? (I want to load them into NameValueCollection) – SharpAffair Oct 28 '10 at 17:41
Dunno about fastest, but Linq is probably the most understandable:
var myListOfKeywords = new [] {"struct", "public", ...};
var keywordCount = from keyword in myProgramText.Split(new []{" ","(", ...})
group by keyword into g
where myListOfKeywords.Contains(g.Key)
select new {g.Key, g.Count()}
foreach(var element in keywordCount)
Console.WriteLine(String.Format("Keyword: {0}, Count: {1}", element.Key, element.Count));
You can write this in a non-Linq-y way, but the basic premise is the same; split the string up into words, and count the occurrences of each word of interest.

- 70,210
- 21
- 112
- 164
Simple algorithm: Split the string into an array of words, iterate over this array, and store the count of each word in a hash table. Sort by count when done.

- 7,908
- 2
- 36
- 33
You could break the string into a collection of strings, one for each word, and then do a LINQ query on the collection. While I doubt it would be the fastest, it would probably be faster than regex.

- 1,964
- 1
- 17
- 26
-
I've implemented single pass string readers before that check for occurances of words/characters as the reading progresses. You see this type of code functions for CSV parsing. – keithwill Oct 27 '10 at 16:52