Linq keyword extraction - limit extraction scope

Question

Is there a way to limit the number of keywords to be taken into consideration? For example, I'd like only first 1000 words of text to be calculated. There's a "Take" method in Linq, but it serves a different purpose - all words will be calculated, and N records will be returned. What's the right alternative to make this correctly?

`Take()` is a lazy function! It doesn't make all the words to be calculated. See http://ideone.com/WwDwg for example. — Vlad, Nov 07 '10 at 12:25

score 2 · Accepted Answer · answered Nov 07 '10 at 12:21

2

Simply apply Take earlier - straight after the call to Split:

var results = src.Split()
                 .Take(1000)
                 .GroupBy(...) // etc

answered Nov 07 '10 at 12:21

Jon Skeet

1,421,763
867
9,128
9,194

Simple solution, but seems to work well in my case. Thanks Jon! – SharpAffair Nov 08 '10 at 12:37

Ani · Answer 2 · 2010-11-07T12:46:56.503

Enumerable.Take does in fact stream results out; it doesn't buffer up its source entirely and then return only the first N. Looking at your original solution though, the problem is that the input to where you would want to do a Take is String.Split. Unfortunately, this method doesn't use any sort of deferred execution; it eagerly creates an array of all the 'splits' and then returns it.

Consequently, the technique to get a streaming sequence of words from some text would be something like:

var words = src.StreamingSplit()  // you'll have to implement that            
               .Take(1000);

However, I do note that the rest of your query is:

...
.GroupBy(str => str)   // group words by the value
.Select(g => new
             {
                str = g.Key,      // the value
                count = g.Count() // the count of that value
              });

Do note that GroupBy is a buffering operation - you can expect that all of the 1,000 words from its source will end up getting stored somewhere in the process of the groups being piped out.

As I see it, the options are:

If you don't mind going through all of the text for splitting purposes, then src.Split().Take(1000) is fine. The downside is wasted time (to continue splitting after it is no longer necesary) and wasted space (to store all of the words in an array even though only the first 1,000) will be needed. However, the rest of the query will not operate on any more words than necessary.
If you can't afford to do (1) because of time / memory constraints, go with src.StreamingSplit().Take(1000) or equivalent. In this case, none of the original text will be processed after 1,000 words have been found.

Do note that those 1,000 words themselves will end up getting buffered by the GroupBy clause in both cases.

score 1 · Answer 3 · answered Nov 07 '10 at 12:23

Well, strictly speaking LINQ is not necessarily going to read everything; Take will stop as soon as it can. The problem is that in the related question you look at Count, and it is hard to get a Count without consuming all the data. Likewise, string.Split will look at everything.

But if you wrote a lazy non-buffering Split function (using yield return) and you wanted the first 1000 unique words, then

var words = LazySplit(text).Distinct().Take(1000);

would work

Linq keyword extraction - limit extraction scope

3 Answers3