How would you write a program to find the shortest pangram in a list of words?

Question

Given a list of words which contains the letters a-z at least once, how would you write a program to find the shortest pangram counted by number of characters (not counting spaces) as a combination of the words?

Since I am not sure whether short answers exist, this is not code golf, but rather just a discussion of how you would approach this. However, if you think you can manage to write a short program that would do this, then go ahead, and this might turn into code golf :)

aren't all pangrams by definition have the same number of characters (not counting spaces)? Also, discussions need to be community wiki — SilentGhost, Apr 25 '10 at 19:03
@SilentGhost: nope, pangrams don't necessary have each letter only once. — zneak, Apr 25 '10 at 19:06
I'll give a +1 to anyone who can generate grammatically correct pangram sentences from scratch... do you have a dictionary or language model to start with? — Stephen, Apr 25 '10 at 19:06
@SilenGhost, what is this community wiki thing? I'm new to this community so I don't know. @Stephen, what I mean is that you are given a list of words, and you try to generate the shortest pangram from that list of words. It doesn't have to be grammatically correct, because that'll be too hard lol. — jonathanasdf, Apr 25 '10 at 19:33
@jonathanasdf, you may read more about "community wiki" in its faq entry (http://meta.stackexchange.com/questions/11740/what-are-community-wiki-posts). If after reading you wonder, what does it have to do with your question, then it might really have nothing to. However, most of subjective questions, where no single answer can be marked correct, should be made "comminity wiki". The edge whether to make question CW is subtle and debatable, there's no general rule. — P Shved, Apr 25 '10 at 19:56

P Shved · Accepted Answer · 2010-04-25T19:40:21.980

8

I would approach this by proving that the problem is NP-hard, and by checking heuristics for the NP-hard problems that look similar.

We can reduce a Set Cover problem to our one. Set Cover is different in that not a number of letters used is minimized, but a number of words used is minimized instead. Assume we want to solve Set Cover problem, given N words, each of length less than M. Let's build another set of words by cloning the given set, but concatenating to each of them N*M non-english letters, say, Ж. If we could build a pangram (over a,b,c...x,y,z,ж alphabet) that requires minimum symbols, that would be a pangram with minimum words, if we remove all Ж letters.

This proves that the original problem is NP-hard, but, unfortunately we need a reduction to some NP-hard problem to reuse its (hopefully already known) heuristic. Set-Cover has a greedy heuristic with logarithmic approximation, but I don't think it applies to the original problem (nature of the Set-Cover problem requires taking letter-rich, long words; it's not the way to solve our problem).

So I'd search a list of related NP-hard problems, and check if there's something of interest. That's how I'd approach this one.

edited Apr 25 '10 at 19:40

answered Apr 25 '10 at 19:30

P Shved

96,026
17
121
165

I had no knowledge of the Set Cover problem before proposing this one. But anyways, even if it is NP-hard, since the universe is just 26 letters and if we add in an upper limit to the number of words in the dictionary there would be a possible solution to this. Of course, the easiest way would be brute force greedy recursion but there definitely are alternatives, like counting the number of times a letter is repeated total and working with that or something. – jonathanasdf Apr 25 '10 at 19:53
26 letters constrain you to "just" 67 million words. :-) This hardly means that you can elide NP-completeness. But this relinquishes requirements to the heuristics: they may be slower than fastest ones. I'm sorry I can't propose some right now; perhaps, someone else will. – P Shved Apr 25 '10 at 19:59
1

I never said it would not be un-NP-complete, but that an algorithm can definitely be made under those constraints (Which would include a realistic upper limit to the number of words in the dictionary used) – jonathanasdf Apr 25 '10 at 20:01
2

Realistic constraint would be 100-200 thousand of words, the size of English dictionary. Btw, you may try the following heuristic: iteratively pick words with the highest ratio `(number of unique letters still not persisting in a pangram)/(length of the word)`. – P Shved Apr 25 '10 at 20:05
In addition to that heuristic, it may be better to assign a value to each letter based on how often it appears in the set of input words, to decide which words to pick over other words. But yeah it's pretty much decided that this does not have a decisive solution so i'm going to just accept this answer. – jonathanasdf Apr 25 '10 at 20:11
@Pavel, we can solve this in `O(n)`.. I was asked this question in an interview and the interviewer guided me for more than half and hour to get to `O(n)` from brute-force. – Anurag Apr 25 '10 at 21:00
reading the question properly always helps. i didn't see the *combination of words* portion, ignore my solution :) – Anurag Apr 25 '10 at 21:55

score 3 · Answer 2 · answered Apr 25 '10 at 19:30

This is an variant of the set cover problem (a.k.a. hitting set problem):

As input you are given several sets. They may have some elements in common. You must select a minimum number of these sets so that the sets you have picked contain all the elements that are contained in any of the sets in the input. It was [...] shown to be NP-complete in 1972[,] and the optimization version of set cover is NP-hard.

It is a variant because we're looking for the minimum number of letters, not the minimum number of words. But I'd think it's still NP-hard, which means that you will not be able to do much better than brute force.

Anurag · Answer 3 · 2010-04-25T21:58:21.663

Here's an O(n) algorithm for a different problem for when you have a string instead of a list of words as input.. It was my oversight, but will leave the solution here cause I don't feel like deleting it :)

Since we are only interested in characters, it makes the problem a lot easier. Maintain a map of each character [a-z] to its position in the string. This map alone is sufficient do determine if we have a pangram and what's its length.

1. Initialize a map of all alphabets to null
2. Initialize shortest_pangram to { length: ∞, value: undefined }
3. Loop through each "character" in given string
  3.1 Update the value of map[character] to current string index
  3.2 If we have a pangram, and its the shortest so far, record its length/value
4. shortest_pangram should have our result

The map we created is enough to determine if we have a pangram - if all values in our map are non null, we have a pangram.

To find the length of the current pangram, subtract the max value from the min value in our map. Remember that before finding the length, we must check if it is a pangram.

Here's a naive non-optimized implementation in Ruby:

class Pangram
  def initialize(string)
    @input = string.downcase.split('')
    @map = {}
    ('a'..'z').each { |c| @map[c] = nil }
    infinity = 1.0/0.0
    @best = { :length => infinity, :string => nil }
  end

  def shortest
    @input.each_with_index do |c, index|
      @map[c] = index if @map.key?(c)
      if pangram? and length < @best[:length]
        @best[:length] = length
        @best[:string] = value
      end
    end
    @best
  end

  def pangram?
    @map.values.all? { |value| !value.nil? }
  end

  def length
    @map.values.max - @map.values.min
  end

  def value
    @input[@map.values.min..@map.values.max].join('')
  end
end

To use, instantiate the class and pass it the entire string. Call .shortest to find the length of the shortest pangram and the matching substring.

pangram = Pangram.new("..")
print pangram.shortest

thanks for the attempt, even if you completely misunderstood the question :) — jonathanasdf, Apr 25 '10 at 23:47

score 2 · Answer 4 · answered Jan 01 '16 at 20:39

This is an old question, so probably you've found some heuristics you already like. I came across this question while exploring ways to generate perfect pangrams, which will be the fewest number of characters (since they are only allowed to use each letter in the alphabet once). Anyway, for future finders like myself:

I wrote a program which has some success. I treated this problem more like graph search than set cover and used A* as a starting point for the algorithm. You can explore the code on github.

The things that helped the most were:

Compress the State Space

I took a dictionary and transformed all the words into their sorted letter set. For example, this way "BAD" and "DAB" are both stored as "ABD". The compressed dictionary I used took ~250,000 words down to ~31,000 unique letter combos which is a massive win.

Heuristics

As mentioned other places, this is NP hard so I started using heuristics. The three I'm currently using are:

Vowel Ratio

When I examine the letters remaining after picking a word, I compute #vowels / #unusedLetters. The motivation for this is pretty simple - having more vowels remaining makes it more likely that I'll be able select words using those letters.

Letter Commonality

When I read in the initial word set, I create a dictionary for each letter in the alphabet and count the number of times each letter appears across all the words. I used this dictionary to prefer nodes where the remaining letters had more common letters. (I believe OP mentioned this one in one of the comments)

Shared 3-Letter Combos

This is similar to the letter commonality heuristic. Again, when processing the initial word set, I created a dictionary which contains all 3-letter combinations which can be made with that word. So for example the letter-set ABC has only one valid combo, but ABCD has [ABC, ABD, BCD]. Remember, I only care about sorted letter-sets after having compressed the initial wordset.

So in the end, must like the letter commonality measure, I have a dictionary mapping all 26 choose 3 possible letter sets mapped to the number of times those combos appear across my wordset. Then I use this to prefer searching nodes where the remaining letters have more valid 3-letter combos.

Another way to describe your first suggestion is collapsing or removing *anagrams* from your word list. — Peter Cordes, Jun 06 '23 at 02:40