How to check if multiple items from a list appear in a string?

Question

Let's say I have a list of keywords:

keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]

And a list of main terms:

`main_terms = ["terms","words","vocab","list"]`

UPDATED to more clearly state the problem:

The script I'm making is to remove near-duplicates from a long list of keywords. I've managed to remove misspellings and slight variants (ex. "hitsory terms", "history term").

My problem is that I have multiple terms that I'm looking for in this list of keywords, but after I've found one of these terms in a keyword (ex. "history terms") all keywords that are identical except with a different term or combination of terms (ex. "history vocab", "history words", "history words terms", etc.) should be considered duplicates.

It is OK to have multiple terms in the keyword (ex. "math terms words vocab") as long as there is not a keyword that is identical save for having a lower number of the terms (ex. "math terms words" or ideally a single term like "math vocab").

http://stackoverflow.com/questions/3931541/python-check-if-all-of-the-following-items-is-in-a-list — drum, Sep 16 '16 at 22:51
Removing `1) any keywords that has more than one of the main_terms in it` with an output of `"math terms words vocab"` which contains three doesn't make sense to me. — TemporalWolf, Sep 16 '16 at 23:00
This appears to be an [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem), as your output doesn't match the explanation. — TemporalWolf, Sep 16 '16 at 23:18
@TemporalWolf, you're right, sorry! I'm new to this and was still trying to wrap my head around the problem. I've updated it to hopefully give a better picture of what I'm trying to do. — DukeSilver, Sep 30 '16 at 15:33

score 1 · Accepted Answer · edited Sep 17 '16 at 01:18

1

Loop through the keywords and check each one against the main_terms:

keywords = ["history terms",
            "history words",
            "history vocab",
            "history words terms",
            "history vocab words",
            "science list",
            "science terms vocab",
            "math terms words vocab"]
main_terms = {"terms","words","vocab","list"}
result = {}
for words in keywords:
    s = set(words.split())
    s_subject = s - main_terms
    subject = s_subject and next(iter(s_subject))
    if s | main_terms and subject and subject not in result:
        result[subject] = words

The turn the result values into a list:

>>> list(result.values())
['math terms words vocab', 'history terms', 'science list']

edited Sep 17 '16 at 01:18

martineau

119,623
25
170
301

answered Sep 16 '16 at 23:11

TigerhawkT3

48,464
6
60
97

I think this actually solves my problem! Thanks! Is there any way that I can rank one of the main_terms above the others? As it currently stands, if "history terms" comes before "history vocab", the latter is eliminated, but what if I preferred to keep all of the keywords that had "terms" in them over keywords that were identical but with a different main_term? – DukeSilver Sep 30 '16 at 21:43
@DukeSilver - If you change the `if` to `if s | main_terms and subject and subject not in result or 'terms' in s:`, that will save (in this example) `'science':'science terms vocab'` instead of `'science':'science list'`. Is that what you had in mind? – TigerhawkT3 Oct 01 '16 at 04:29
not quite (I misworded my comment above, sorry!). I don't want to return things that weren't in the original keyword list. The problem is that it keeps the first main_term instance that it finds, and eliminates the rest, but I'd love for it to 'rank' the main_terms. For instance, have it look for `history list` but if that's not a keyword, have it look for `history vocab`. `history vocab` is in the keywords list, so it would keep that one and eliminate the others (`history terms`,`history words`, etc). Do that make sense? – DukeSilver Oct 03 '16 at 15:03

blacksite · Answer 2 · 2016-09-16T23:15:11.017

I'm sure there's a more elegant solution, but this seems to be the solution for which you're looking, at least for part 1):

>>> def remove_main_terms(keyword):
        words = keyword.split()
        count = 0
        to_keep = []
        for word in words:
            if word in main_terms:
                count += 1
            if count < 2:
                to_keep.append(word)
            else:
                pass
        return " ".join(to_keep)

>>> keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]

>>> main_terms = ["terms","words","vocab","list"]

>>> new_list = []
>>> for w in keywords:
        new_list.append(remove_main_terms(w))

>>> new_list
['history terms', 'history words', 'history vocab', 'history words', 'history vocab', 'science list', 'science terms', 'math terms']

score 0 · Answer 3 · edited Mar 20 '17 at 10:29

EDIT: I'm increasingly thinking you're asking an XY Question and you want unique subjects.

If that is the case, the following works even better:

result = []
found = []
for word in keywords:
    for term in main_terms:
        if term in word:
            word = word.replace(term, "")
    result.append(word.strip())

print set(result)

Which outputs set(['science', 'math', 'history'])

This solves your original problem with the same results, but does it by ignoring terms after the first and only passing unique first words.

result = []
found = []
for word in keywords:
    found = False
    for res in result:
        if word.split()[0] in res:
            found = True
    if not found:
        result.append(word)
print result

See the demo on repl.it

How to check if multiple items from a list appear in a string?

3 Answers3