Checking if a word is a subset of another word with same amount of letters

Question

I am making a word game program where I get a list of ~80,000 words from a text file, then use those words as a lexicon of words to choose from. The user requests a word of a certain length which is then given to them scrambled. They then guess words that are of the same length or less and that use the same letters in the same amount or less. I have this list comprehension in order to get all the words from the lexicon that are subsets of the scrambled word and are also in the lexicon. However it allows more occurrences of letters than appear in the original word. For example: If the scrambled word was 'minute', then 'in' should be a correct answer but 'inn' should not. The way I have it written now allows that though. Here is the list comprehension:

correct_answers = [
    word for word in word_list
    if set(word).issubset(random_length_word)
    and word in word_list
    and len(word) <= len(random_length_word)]

So I'm looking for something like issubset but that only allows the same number of letters or less. Hopefully that makes sense. Thanks in advance.

I feel like you have understood what you want to do very well. If I were you, I won't be doing this in a list comprehension way since you have two big logics in there. You can create a function to check a single word, and then apply it to all the words you have. Or you can create a smaller helper function specifically for checking the frequency, then use that within your list comprehension. — TYZ, Feb 12 '21 at 17:01

wjandrea · Accepted Answer · 2021-02-13T02:56:07.610

I wrote a function that does this for playing the Countdown letters game. I called the desired input a "subset-anagram", but there's probably a better technical term for it.

Essentially, what you're looking for is a multiset (from word) that is a subset of another multiset (from random_length_word). You can do this with collections.Counter, but I actually found it much faster to do it a different way: make a list out of random_length_word, then remove each character of word. It's probably faster due to the overhead of creating new Counter objects.

def is_subset_anagram(str1, str2):
    """
    Check if str1 is a subset-anagram of str2.

    Return true if str2 contains at least as many of each char as str1.

    >>> is_subset_anagram('bottle', 'belott')  # Just enough
    True
    >>> is_subset_anagram('bottle', 'belot')  # less
    False
    >>> is_subset_anagram('bottle', 'bbeelloott')  # More
    True
    """
    list2 = list(str2)
    try:
        for char in str1:
            list2.remove(char)
    except ValueError:
        return False
    return True

>>> [w for w in ['in', 'inn', 'minute'] if is_subset_anagram(w, 'minute')]
['in', 'minute']

For what it's worth, here's the Counter implementation:

from collections import Counter

def is_subset_anagram(str1, str2):
    delta = Counter(str1) - Counter(str2)
    return not delta

This works because Counter.__sub__() produces a multiset, that is, counts less than 1 are removed.

I have written a `Counter` implementation (see my [answer](https://stackoverflow.com/a/66176061/3566606) below), but your approach is quicker, since it does not require to count all characters, but in most cases it can return `False` much earlier. +1 — Jonathan Scholbach, Feb 12 '21 at 17:29
@jonathan Ah yeah, I didn't even think about short-circuiting! Thanks! — wjandrea, Feb 12 '21 at 17:32
This is great, thanks, i forgot to mention in the question that im not allowed to use any modules so counter isn't a possibility for me. Thank you! — Justin Phelps, Feb 12 '21 at 19:05
@jonathan I added my `Counter` implementation, if you want to take a look — wjandrea, Feb 13 '21 at 02:58

Jonathan Scholbach · Answer 2 · 2021-02-12T17:23:55.073

Your approach loses the information, how often a certain character appears, because set(answer) does not contain this information any more.

Anyway, I think you are over-complicating things with your approach. There is a more efficient way for checking whether an answer is correct, instead of creating a list of all possible answers:

We could just check whether the answer has the matching character frequencies with any of the words in word_list. More specifically, "matching character frequencies" means that all characters appear less (or equally) often in the answer and the candidate from word list.

Getting the character frequencies of a string is a classic job collections.Counter has been invented for.
Checking that the character frequencies match means that all characters in the word have less or equal count in the answer.
Finally, checking that an answer is correct means that this condition is true for any of the words in word_list.

from collections import Counter
from typing import List


def correct_answer(word_list: List[str], answer: str) -> bool:
    return any(
        all(  
        # this checks if each char occurs less often in the word
            Counter(answer)[character] <= Counter(word)[character]
            for character in Counter(answer).keys()
        )
        for word in word_list
    )

This is more efficient than your approach, because it takes way less memory space. Thanks to any and all being short-circuit, it is also quite time-efficient.

Checking if a word is a subset of another word with same amount of letters

2 Answers2

Linked