18

I have a set, sentences, which contains sentences from the English language in the form of strings. I wish to create a subset of sentences, sentences2, which contains sentences containing only 20 unique words. Of course, there are many, many such subsets, but I'm looking for the "best" one and by "best" I mean that subset where all words have the highest possible representation in sentences2.

The following example, will further clarify what I mean by "best":

If I was to filter sentences for this set of words:

(i,you,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,know,here,feel,are)

I would get the following:

sentences2 = set(("where are you now", "here i am", "can you do it", "yes i can", "but can i do it", "no you cant", "do you feel good", "yes i do", "why are you here", "i dont know", "i think i know why", "you dont think", "yes i do", "no you dont", "i dont think you think", "i feel good", "but i am good", "i cant do it now", "yes you can", "but i cant", "where do you think i am"))

and here each word is represented at least twice, as we can see if we use a counter on sentences2:

c = collections.Counter({'i': 13, 'you': 10, 'do': 6, 'think': 5, 'dont': 4, 'can': 4, 'good': 3, 'but': 3, 'am': 3, 'it': 3, 'cant': 3, 'yes': 3, 'know': 2, 'no': 2, 'here': 2, 'why': 2, 'feel': 2, 'are': 2, 'now': 2, 'where': 2})

If each word is represented at least twice we can say that this set of 20 words has a score of 2.

score = min(c.values())

However, the following set:

(i,you,he,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,here,she,are)

has a score of 5, since if I use it to filter sentences, I get a sentences2 where each word is represented at least five times.

So I'm after the highest possible score for all possible 20 word combinations.

Here is my attempt at solving this problem:

sentences = ... # all the sentences in my text
common_words = ... # the hundred most common words in the text
result_size = 20

highest_score = 0
for sample in itertools.combinations(common_words, result_size):

    sentences2 = list(filter(lambda s: set(s).issubset(sample), sentences))
    c = Counter([j for i in sentences2 for j in i])

    if len(c.values()) and min(c.values()) > highest_score:
        # this is the set with the highest score to date
        print(c)
        highest_score = min(c.values())

However, this algorithm will take forever to compute, with 5.3598337040381E+20 combinations if I'm not mistaken. Can you suggest how I might go about solving this with a much faster algorithm?

Please note that the resulting set can contain less than 20 words and that this is completely fine. For example, c.values() in my algorithm does not have to match the size of result_size.

Also note that I'm expecting the words in the resulting set to be found in the top one hundred words (common_words contains 100 values). This is also by design.

Baz
  • 12,713
  • 38
  • 145
  • 268
  • 3
    You can make it more clear by adding a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) to your question. – Mazdak Aug 25 '15 at 11:33
  • 1
    So, you want the 20 words where the minimum each of those words appears in any of the sentences is the highest? Probably would be a good idea to create a dictionary mapping words to their minimum occurrance in any of the sentences. – tobias_k Aug 25 '15 at 11:43
  • @tobias_k Updated my question to make it clearer! – Baz Aug 25 '15 at 11:50
  • 2
    @Baz: Can a word appear several times in one sentence? How does that count? Do I understand correctly, if you were to select a single word for your set, you would not select the word occurring the most in total but rather the one appearing the the most number of sentences? What condition does a sentence have to fulfil in order to qualify for the set sentence2? Better than examples is to give an *exact* specification....I find your first paragraph very hard to understand. Can you clarify further please? – dingalapadum Aug 25 '15 at 12:16
  • 1
    Please *specify* what you want. currently, your question is unclear... – Serge Ballesta Aug 25 '15 at 12:18
  • It might help if you could upload the inputs somewhere. – David Eisenstat Aug 25 '15 at 12:41
  • 1
    @dingalapadum Words can occur more that once in a sentence but I actually only want it to count once in such a case. However this isn't very important and my algorithm doesn't do this correctly at the moment. I've again updated my question with the hope that it now makes sense. – Baz Aug 25 '15 at 13:08
  • To me the two sets of words you're using look the same. What's different? – Joel Aug 27 '15 at 12:00
  • Fixed that @Joel, thanks! – Baz Aug 27 '15 at 12:04
  • I still can't understand what's happening. 'but' appears in 3 sentences. So I don't see how your score for the second set can be 5. – Joel Aug 27 '15 at 12:08
  • 3
    You're willing to set a 500 point bounty but not upload an input that you actually care about? – David Eisenstat Aug 27 '15 at 12:16
  • @David Eisenstat My input is a corpus contain millions of words. You can take any piece of English text containing a large number of sentences. I'm using a special library to extract sentences from the corpus and I don't want to add that to my question. – Baz Aug 27 '15 at 18:19
  • This problem definitely feels NP-hard. – asmeurer Aug 27 '15 at 19:08
  • To clarify: does "which only contains sentences containing 20 unique words." mean "contains sentences containing only 20 unique words"? – Joel Aug 28 '15 at 04:06
  • @Joel Yes thats correct! – Baz Aug 28 '15 at 08:23
  • Is it essential that solutions/suggestions offerred are in Python, or would any (sufficiently-readable) pseudo-code be ok for you? – Bert te Velde Aug 28 '15 at 16:38
  • @Bert te Velde pseudo-code is fine! – Baz Aug 28 '15 at 17:55
  • I'm a bit puzzled by your remark that the resulting set may contain less than 20 words. Please help me understand the specs correctly: what prevents me to specify the "special words" to have only 1 item, which of course would then be the most common word (e.g. "a" or "the")? Or is it that the filter must always contain precisely 20 items, but that I may choose my sententes2 such that certain words simply don't occur in them, and that this excludes them from the "count"-criterium, i.e. you only consider counts>0 for themax-min test? I'm sorry if I misread your specs but want to be sure. – Bert te Velde Aug 28 '15 at 20:19
  • Sorry, forget about the 1-word idea: sentences must contain ONLY words from the filter, don't they? Remains my other clarification question: count=0 for some word in the filter doesn't make the "value" of the set 0: we consider for the max-min test only words that occur at least once in the sentences2 set, correct? – Bert te Velde Aug 28 '15 at 20:55
  • @Bert te Velde I test 20 words for each iteration of my for loop. This loop begin by filtering all the sentences so that I am left with only those sentences that contain words from this set of 20. It is therefore possible that when I do this that one of the words, for example, ends up with a count of zero. For example, the word "the" is dependant on nouns and if I take a set of 20 words including "the" and 19 non nouns, then I will have no sentences with the word "the" having done so. But if I get a top score for the remaining 19 words, then I accept this set and forget about the "the". – Baz Aug 28 '15 at 20:57
  • @Bert te Velde Yes, that is correct! – Baz Aug 28 '15 at 21:00
  • I'd like to try to attack this, but I'd want to have the input! – Antti Haapala -- Слава Україні Aug 29 '15 at 09:41
  • @Baz It's a really interesting challenge. I've contributed an "answer" that is only a description of an approach (feel free to ask if you need more clarification). Would you mind providing the 100 "common_words" (assuming these are pre-set)? I'd like to organize some real tests for my own (C#) implementation one of these days. Or are the common_words to be detected from the actual input? (This part of the spects isn't clear to me) – Bert te Velde Sep 01 '15 at 07:12
  • @Antti Haapala There is no specific input. Rather, you can take a large piece of text for a particular language and split it into sentences. – Baz Sep 01 '15 at 11:54
  • It feels like an NP-hard problem. If it is the case you cannot guarantee you found the best solution. However you could get close using genetic algorithm to find a local maximum. – Jonathan Nappee Sep 02 '15 at 16:33

6 Answers6

7

Disclaimer: You have not specified data characteristics, so my answer will assume that it is not too large(more than 1,000,000 sentences, each at most 1,000). Also Description is a bit complicated and I might have not understood the problem fully.

Solution:
Instead of focusing on different combinations, why don't you create a hashMap(dict in python) for your 100 most frequently used words, then traverse the array of senteces and for each word in each sentence, increase its corresponding value(if it is already inside the dict).
In the end, just sort this hashMap according to the number of occurences(value) of each word(key), then use the most frequent 20.
Complexity:
A quick look at the algorithms, gives:
Traversing N sentences, traversing each with M words, increasing the hashMap value. at the end sorting an array of (word, occurrences) pairs. which is negligible(hashMap size is constant, 100 frequently used words), and extracting first 20.
Time Complexity : O(N*M)
Space complexity : O(1) (we don't need to store the sentences, we just have the hashMap)

Sample Code:
Here is a quick psuedo-code:

word_occur_dict = {#initialized with frequent words as keys, and zero as value for all}
for sentence in sentences:    #for each sentence
    sentence_words = sentence.split(" ")    #construct the word list
    for word in sentence_words:        #for each word
        if word in word_occur_dict:         #if it is a frequent word, increase value
            word_occur_dict[word]++
final_result = sort_dict(word_occur_dict)[:20] #returns list of tuples

Python Code:

import operator

common_words = ["do","think","yes","dont","can","it","good","cant","but","am","why","where","now","no","know","here","feel","are","i","you","he","she"]
common_words_dict = {}
sentences = ["where are you now", "here i am", "can you do it", "yes i can", "but can i do it", "no you cant", "do you feel good", "yes i do", "why are you here", "i dont know", "i think i know why", "you dont think", "yes i do", "no you dont", "i dont think you think", "i feel good", "but i am good", "i cant do it now", "yes you can", "but i cant", "where do you think i am"]

for w in common_words:    #initialize the dict
    common_words_dict[w] = 0    

for sentence in sentences:    #for each sentence
    sentence_words = sentence.split(" ")    #construct the word list
    for word in sentence_words:        #for each word
        if word in common_words_dict:         #if it is a frequent word, increase value
            common_words_dict[word] = common_words_dict[word]+1

sorted_word_dict = sorted(common_words_dict.items(), key=operator.itemgetter(1))

print sorted_word_dict[::-1][:20]

By the way, 'he' and 'she' does not appear anywhere on the sentences, but you said the following word combination has a score of 5

(i,you,he,do,think,yes,dont,can,it,good,cant,but,am,why,where,now,no,here,she,are)

Have I misunderstood the problem.

Credit where it is due: StackOverflow: Sort a Python dictionary by value

Community
  • 1
  • 1
Kamyar Ghasemlou
  • 859
  • 2
  • 9
  • 24
  • Isn't this only the initial (and rather simple) part, i.e. determining which words occur most frequently in all input? It seems to me that selecting from that sorted list the top-20 is not necessarily the solution for the problem as stated. – Bert te Velde Sep 09 '15 at 13:17
5

STEP 1 should be to create a data structure that has only the words in sentences that appear in common_words. The structure can also have the number of times the word appears and a set of integers that references which sentences' the word is in.

counts[..., {
  word:string,
  count:number,
  ids:Set<number>
}, ...]

Some pseudo code

countsMap = Map()
For i = 0 To sentences.Size - 1
  sentence = sentences[i]
  For Each word in sentence
    If Not countsMap.Contains(word) Then
      countsMap.Add(word, {word:word, count:0, ids:Set()})
    End If
    value = wordMap.Get(word)
    If Not value.ids.Contains(i) Then
      value.Count++
      value.ids.Add(i)
      countsMap[word] = value
    End If
  Next
Next
counts = countsMap.Values

Idealistic STEP 2 If you're lucky and your counts data type contains < 40 entries you can do an exhaustive search of C(n, 20) combinations in a reasonable amount of time with a single computer C(38, 20) ~= 33billion. This would involve iterating over the combinations and intersecting the ids sets together, the final set size is your minimum score.

Some pseudo code

bestScore = 0
bestCombo = null
For Each combo in Combinations(counts, 20)
  score = combo.Reduce((prev, curr) => prev.ids.Intersect(curr.ids)).Size
  If bestScore < score Then
    bestScore = score
    bestCombo = combo
  End If
Next

Realistic STEP 2 In most cases your counts will contain mmany more than 40 unique words in which case you'll have to settle for a best guess / approximation. I would probably do something like, use the above code but instead of Pick 20 use Pick 2, sort your results descending by the score and take 10.

Some pseudo code

list = []
For Each combo in Combinations(counts, 2)
  score = combo[0].ids.Intersect(combo[1].ids).Size
  list.Add( { score:score, words:[ combo[0].word, combo[1].word ] } )
Next
// sort descending by score
list.Sort((a, b) => b.score - a.score)
// grab the 20 best words
result = Set()
i = 0
While result.Size < 20
  result.Add(list[i].words[0])
  result.Add(list[i].words[1])
  i = i + 1
End While

Will you get a final score greater than 1? Statistically that would depend on how many unique words and sentences there are, but probably not.

EDIT An implementation note and correction. Intersecting the sets of sentence ids that the words appear in will give you a minimum score minus 1 (zero indexed). For example, "Dog" is in sentences 1 and 2; "Cat" is in sentences 2 and 3"; "Frog" is in sentence 4; the intersection of [1,2] /\ [2,3] /\ [4] = [] but the minimum score is 1 result.Size() + 1. In the same way just "Dog" and "Cat" [1,2] /\ [2,3] = [2] has a set size of 1 but the minimum score is 2.

Louis Ricci
  • 20,804
  • 5
  • 48
  • 62
  • 2
    @BurhanKhalid - your right, it's not any specific programming language, it's pseudo code - a notation resembling a simplified programming language, used in program design - a higher level of abstraction that represents the steps in an algorithm. – Louis Ricci Aug 27 '15 at 18:13
  • @Thanks Louis Ricci. Can you explain the line: score = combo.Reduce((prev, curr) => prev.ids.Intersect(curr.ids)).Size. What is the type of combo and what are prev and curr? – Baz Aug 27 '15 at 18:24
  • @Baz - combo is a collection of the elements of the count type that I defined above. Reduce is a common map/reduce terminology for taking a collection and return a single value. So all the line is doing is looping through all of the elements of combo (the elements contain a word, count and ids property) and intersecting all of the ids sets together, this returns the set of unique indices into *sentences* that contain all of the words, the size of which is your score. – Louis Ricci Aug 27 '15 at 18:47
4

May I suggest you try the approach below. I have no mathematical proof that it produces the absolutely best "score", or indeed some other "measure of goodness", but I do believe it may help you out, unless, of course, the case requires that you actually prove and find the absolutely best.

I don't speak python, but the strategy and ensuing implementation are simple enough to not even require pseudo-code to explain. I will use quite a few words though, not because it is very complicated, but to be clear (I hope).

You will need a method to diagonalize a matrix, or at least to find the eigenvector corresponding to the largest eigenvalue. Not every language provides natively a method for that. In C++ you might use the Eigen library. In C#, which I used to test, I hooked up MathNet. In python, I've no idea. The requirements for the diagonalization/eigenvector facility are limited: the matrix is always real, symmetric, all elements are positive. However, although the diagonal elements are at least as big as any other in their row/column, the matrix is certainly not "diagonal-dominant".

The problem you present can in abstracto be formulated for integers, not words, which makes it more convenient (for me...) to formulate and implement (but otherwise, it is quite irrelevant). In fact, we only need your "common_words", hence 100 integers [0..99], and you set up once the map between words and integers simply by putting the words in an array or list and using their index.

As a side note: the strategy is likely (much) more adequate for your language application than for a completely generalized problem with integers where you might construct "maliciously" difficult input. The reason is that we'll essentially exploit pair-wise correlations, if there are any, between the items (words within the same sentence), whereas dominantly strong correlations for triples, quadruples, ... might make it less efficient (but we'll do something for the triples). Obviously, in well-formed language you'd expect correlations: "am" is often to show up together with "I", and whereas "have" may in total be more frequent than "has", as soon as you find "he" you might be more likely to also get a "has" than "have" in the same sentence. And so on.

A few definitions are in order.

NCommon is the number of elements in your common_words (i.c. 100). The common_words "are" the integers [0..NCommon-1].

NMaxSet is your problem-limit (i.c. 20) the maximum number of 'words' you are willing to use, finally.

F is a filter: the collection of 'words' that you're using to define which sentences are included in some collection. In the end you must have adapted your filter such that F.Count does not exceed NMaxSet.

M(N) is a square NxN matrix; row and column indexes are in [0..N-1]

S(F) is the collection of sentences (from the problem input) that satisfy a filter F. A "sentence" is here always a collection of integers in the range [0..NCommon-1], see above. Duplicate words in the same sentence are irrelevant (problem description) and such duplications are removed from our integer-sentences.

Here we go.

I. Preparation.

Initialize a matrix MCommon = M(NCommon). Create the FCommon filter that contains all the common_words.

Filter the input, removing duplicate words in sentences. This produces a sentences set S0 = S(FCommon), which is your "real" input: everything that's irrelevant has been removed.

On-the fly, while accepting sentences into S0, fill the matrix MCommon: {for (j in sentence): for(k in sentence): M(j,k)+=1}. The matrix is symmetric, so you may just fill the upperright triangle and mirror at the end into the lowerleft triangle.

Having completed the scan, M[j,k] is the number of k-occurrences in 'sentences' that contain j (and vice-versa: the matrix is symmetric). M[k,k] is the total count of k in all sentences in S together. M is the (pair-wise) correlation matrix, telling you how likely it is that a {j,k} combination occurs in the underlying sentences set S.

(Always, when dealing with such matricies: if there happen to be, at the end, empty columns (and hence rows): remove these columns and rows, simultaneously removing the corresponding entry in the Filter that underlies the matrix, as obviously it doesn't play a role. Henceforth we'll assume this has been done (unless stated otherwise), i.e. no column/row in the matrix is empty.)

II. Compute result (main approach, we'll refine below):

Compute the eigenvector E of MCommon that corresponds to the highest eigenvalue: E is an array (vector) with NCommon coefficients.

Set NTarget=NSetMax

Determine which are the NTarget biggest coefficients in E. We're not interested in their values, but in their indexes. Put these indexes together: they define a new filter F(NTarget).

Run S0 through the new filter to produce STarget. Count all 'word'-occurrences, find their minimum: that's your "set-quality" value. You may do so, for instance, by also computing the associated MTarget and scanning the diagonal values MTarget[k,k]. That may seem to involve unnecessary additional effort, as you're computing also the off-diagonal elements, but we'll see that MTarget may be handy in the follow-up refinements.

III. Refinements.

A) We need to verify whether by removing one or a few more items from the filter F(NsetMax), reducing it to some NTarget smaller than NSetMax, we would get a better score value. This requires some care: it is very well possible that removing TWO (or 3, ...) items would improve the score, but that removing either of them would deteriorate it.

Let's take a first (and fairly reasonable) shot at this issue.

While producing STarget you also fill a new matrix MTarget (you see, it is handy...), as you did with MCommon before. The size is NTarget. Get it's largest-eigenvalue eigenvector. Determine the SMALLEST coefficient. Remove from the filter the corresponding index.

Filter again (as input you can, of course, now use the STarget collection) and compute the score. If it's better: mark/remember. In all cases (improvement or not) continue in the same fashion, reducing the filter one-by-one until you've nothing left.

B) One may expect that the reason, as briefly explained, for the "cautious" approach in further reducing the filter below NSetMax - one at a time - might also apply to some extent to the first step where we reduce F(NCommon) to F(NTarget=NMaxSet) in one big strike.

To accomodate this, we may want to go from NCommon down to NMaxSet in steps. In order not to incur too much computational overhead, we won't take steps of size 1, but rather each time for instance a reduction by some 10%, or something similar.

So, in II above, do not set NTarget immediately to NSetMax, but (for instance) set NTarget = 90. Construct the corresponding filter, apply the filter to S0, giving S1, produce also the matrix M1, get its eigenvector (largest eigenvalue). Repeat: set NTarget=80, 70, 60 and so on. In later stages you may get even more cautious, lowering 40 to 35 to 30 to 28 to 26 ... In each step you build upon and use the results from the previous, to optimally exploit the reduction in size and computational effort.

All the time you want to monitor whether or not the largest NSetMax (final value in this part of the algorithm) coefficients always occur with the same indexes or not. This will give some information regarding whether the final result is reliable: the stability, or not, of the expected final filter against the algorithmic path towards it.

C) Consider the situation where we have reduced to NSetMax and investigate whether or not it should be reduced further in a one-at-a-time approach. (The same may apply in the final stages of (B), if, when approaching NSetMax from above, you go "one-at-a-time" as soon as you get close to NSetMax.)

Suppose there are during this phase of the algorithm, in your (first or a later) STarget collection, certain PAIRS of 'words', such that removing such specific pair from the filter would improve things, while neither of their individual coefficients in the eigenvector is the smallest. I'm not sure whether or not this is likely or even possible, but let's see how we might handle it. If (your) tests show that it's irrelevant, you may in a final implementation remove this feature from the algorithm.

Assume we have some NTarget, an associated filter FTarget and matrix MTarget. (The order of the items ('words') in the filter always equals (of course) the order of the rows and columns in the matrix.)

From MTarget we can directly deduce some information about what would happen if we were to remove the j-th item from the filter. Of course, the j-th row and j-th column in the matrix become empty (all zero's). More interestingly, M[j,k] presents how many times item k occurs in all sentences that contain item j together. So, when eliminating all j (by removing it from the filter), we know in advance that in the resulting new matrix MReducted, element MReduced[k,k] will diminish precisely by that M[j,k] value. (By the way, this provides an alternative way of determining which item to remove (if you opt to remove only one): the minimum{j: M[j,j]} is the score of the associated set S, and from M we can compute how ALL diagonal elements would change upon removing a particular item j, enabling a precomputation of the resulting score)

Here, however, we would like to know how the score would be affected upon removal of some PAIR of items, j and k. We now need to determine how M[p,p] is affected for all p that are not j or k. This, you can not directly compute from M: removing j affects row k and whereas we know how it changes [k.k], and any other [p,p], we don't know how it would change [k,p], which is needed to compute how ALSO ('subsequently') removing k will change [p,p]. Note, by the way, that it must be immaterial for the final outcome whether you remove first j and then k or vice-versa, or both simultaneously.

For this we need some kind of 'derivative'. Fortunately we can compute and process that without too much computational effort (given that NTarget is already rather small, 20 or so).

Consider a 'reductionfilter' G(F; j) related to the current filter F, defined simply as processing F but ignoring therein item j. With G we compute, in the same manner as always, a 'reductionmatrix' N(G), where for convenience in the discussion (and implementation) we keep the same size (not removing empty column/row). Of course, in such N matrix, the j-th row and j-th column are empty. Its diagonal elements will have values as explained above (we could have computed these from M directly). However, now we also have all off-diagonal elements that result from removing j.

From N(G{F;j}) we can now determine what would happen if we remove ALSO item k, see elaboration above about how to get the expected score from a current matrix. So, this allows computing the score upon removal of the pair {j,k}.

If the setsize is 20 we have to compute 20 N(G(F;j)) matrices, a relatively small effort, I assume (your sentences-collection will also by now have become a lot smaller than originally). Then, having all N, compute for each of the 20*19/2 unique pairs the resulting pair-removal-scores and you're in the position to choose which PAIR (rather than individual element) to remove from a filter. You may compare that, on the fly, against reduction by 'one-at-a-time' and make appropriate choices how to systematically reduce the filter in search for a better score. There are many ways to go about this comparison. A relatively simple one would be: calculate a reduction from first a pair and then a single one (in total 3 elements). Compare against first removing a single one and then a pair. Pick the best of these two, but perform from that choice only the first step (either a single or a pair) and repeat the procedure.

Note that using these "derived" filters G, matrices N and the ensuing score-precalculations as explained, you effectively bring in correlations between TRIPLES of items: you determine what happens for all {j,k,p} to the frequency of p upon removing both j and k. This idea can, of course, be extended to incorporate quadruples and beyond, but (a) I don't believe it will practically help very much, and (b) the further you go on this road, the faster the computational effort will increase. I even doubt whether bringing in the "triples" as explained here is relevant, but I'm not a language expert and apart from a little additional programming effort there's no major drawback.

D) The backbone of the strategy is relying on the eigenvector with the largest eigenvalue to point to the relevant items to subsequently filter with. Of course, it may happen that two or more eigenvalues are "almost" the same and the corresponding eigenvectors may point to completely different sets of items when analyzing their largest components. In that case it will be worthwhile to "branch", that is: go with one of them, work to the end, and afterwards redo everything with their competitor(s) and see if they produce a better outcome. A problem arises here if you encounter a lot of branches (at various points on the road to a solution). This is not likely to happen, but the implementation should have some strategy to deal with it in a practical way if it does happen. I suggest that you first leave any branching out, but that you do monitor the occurrence of "competitive" largest eigenvalues so as to output a warning to the user. Alternatively you may implement a handling of such branching, but be very strict on what the programm will consider "almost equal", or put in some limit as to the number of branches that are (in total) to be investigated, thereby reducing the chance that the computationtime runs out of hand.

I have to leave it here (for lack of time...), hope I've explained sufficiently the idea, as well as some details and refinements to consider.

I've not had the time to organize for myself relevant 'real language' sentences as input and test against that. I've programmed the basic steps in C#, tested against random integer-'sentences', with some biasses to force certain 'words' to occur more often that others, but without bothering about correlations among 'words' within 'sentences'. The results appeared quite reasonable to me, but I didn't have the time to analyze it extensively.

Given the lack of structure in the 'random' sequences that I've used as input, whereas real language is expected to exhibit significant pair-wise correlations, which the strategy exploits, I have good hope that it might be a useful thing for you to have a look at.

[EDITED] Note added: in the above I've occasionally been sloppy in talking about "j", "k" and so on. I'm sorry about that. Sometimes "j" refers to the j-th entry of something (a Matrix row, an index in a filter-list), sometimes it refers to the corresponding VALUE in (usually) a filter. For instance, a filter may contain 18 items, numbered (indexes) 0..17, but their VALUES always refer to the original Common_words list, so they may be {3, 15, 29, 30, 31, 40, ...}. Then, when it says "remove j from the filter", it will usually mean "remove the j-th entry from the filter (and that entry may have any value from [0..NCommon-1]). When APPLYING a filter to a sentence, you compare the VALUES in the filter against the values in the sentence. I hope that the context - in combination with a fair understanding of the line of reasoning - always makes it clear what is really meant.

[EDITED: some test results] I've ran my C# implementation, using the above algorithm (more or less: most but not all refinements/details described) to get some impression of what it would produce.

For input I took 4 books (plain text) from the Gutenberg project. In total (only) 27k sentences, 380k words (20k different), so a rather small sample.

The list of common words as determined from that input started with "the", "of", "and", "to"... (when sorted to frequency of occurrence in the total input).

The algorithm filtered 14 words ("i", "what", "do", "you", "it", "yes", ...) to produce an "optimal" "set-quality-value" of 8 (139 sentences were found with only those 14 words).

As I was suspicious about the assumption that just 100 "common words" were to be used, I had a priori allowed 500 common words, rather than 100, and sure enough among the words that made it to the final filter, 4 were frequency-numbered above 100 ("yes", "know", "say", "think"): "yes", for instance, was #224 in the overall list if you were to sort them by occurrence in all input, presumably the base for "common words".

Bert te Velde
  • 823
  • 5
  • 8
3

I'm not sure if it's possible to come up with the best solution in less than exponential time, the problem may not have enough structure. But here is a heuristic to come up with a 'good' solution.

I think the way to do this is to start with wordset having size 0, and adding words to it one by one in a 'clever' way with a max of 20. Consider that for a given wordset_n, the score for each individual word can only increase or stay the same when a new word is added. The only way that wordset_(n+1) can have a lower score than wordset_n is if the (n+1)th word brings down the minimum. So we restrict ourselves to only adding words which bring the minimum up or keep it the same (but see an elaboration below).

So to a first approximation,

  1. Sort the words in common_words by frequency in sentences.
  2. Add the most-frequently occuring words to wordset until the score is nonzero.
  3. Search the tree of possible wordsets constructed by only adding words which increase or maintain the score of the previous node (max N = 20). So a path down this tree will be wordset_n, wordset_(n+1), wordset_(n+2)` with each node consisting of the previous node plus one new word.
  4. Choose the leaf node with the highest score.

On a second approximation, a) you may want to try multiple combinations for step 2. 100 chose 3 is 162000 which is not out of the question. b) Additionally, for step 3, you may want to look ahead two steps, not just one--i.e., only reject a word for spot n+1 in wordset if, after all possibilities for word n+2, the score is still lower than wordset_n. This might help if there are anti-correlations, i.e. a short sentence with one verb is unlikely to contain another. Lastly c) if this method still is computationally prohibitive you can constrain tree construction still further by closing off branches where the (n+1)th doesn't raise the score by a given amount.

If the distributions of words in sentences are relatively independent of one another, or only exhibit positive correlations, this method may get you something pretty close to optimal.

Matt Phillips
  • 9,465
  • 8
  • 44
  • 75
1

I would look for a form of data clustering, to speed up the search across each 20 word trial.

The data which matters is the unique words in a sentence.

2 sentences can be considered close if the jaccard distance is small. The jaccard distance is 1 - (size of(intersection of words in sentences))/( size of( union of words in sentences)).

Because the jaccard distance is metric (meets triangle inequality), a m-tree index can be built which allows faster finding of sentences.

From this basis clustering can occur ( the best results will be close to each other).

Finally by iterating through each sentence, and finding the K nearest neighbours, you can find a set of up to 20 words which could be used. This will give a different value for K (sentences which share best 20 words).

I would probably use a database to support this select union and select intersection allowing set testing

mksteve
  • 12,614
  • 3
  • 28
  • 50
0

NP-hard by reduction from CLIQUE (assuming that we replace 20 with a parameter). Given a graph in which we're looking for a k-clique, assign each vertex a unique word, make the two-word sentence corresponding to each edge, and try to select k choose 2 sentences that include each word k - 1 times.

Need to think about whether there's an algorithm with reasonable parameterized complexity.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • Thanks for your answer. Would you mind adding an example for some simple sentences? – Baz Aug 25 '15 at 13:37
  • Can you kindly explain your answer in a little more detail as I am struggling to understand it. Presuming that I am using networkx, should I be adding edges to the graph, for each sentence, as follows: for word_pair in itertools.combinations(sentence,2): G.add_edge(word_pair[0],word_pair[1]) – Baz Aug 25 '15 at 20:53
  • Thanks again. What do you mean by "make the two-word sentence corresponding to each edge"? Do you mean that I should add all "two-word combinations" for each sentence to the graph. For example, the sentence "i am here" includes the "two word sentences": "i am", "am here", "i here". So the vertex "am", for example, will have two edges entering it in the form of "i am" and "am here". I also dont understand why cliques are relevant here. For example, it is possible that my solution set includes the words "he" and "she". However these words might never occur in the same sentence. – Baz Aug 26 '15 at 10:52
  • What direction does it go then? – Baz Aug 27 '15 at 12:26
  • @Baz: This is not really an attempt to provide an solution, but merely to show/prove that an exact and efficient general solution is extremely unlikely to exist. – Nuclearman Aug 28 '15 at 00:33
  • 1
    (Yes, downvoters, this could be a comment but for the reckless deletion policies for that type of communication.) – David Eisenstat Sep 01 '15 at 18:01