2

Let's say I have a list of keywords:

keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]

And a list of main terms:

`main_terms = ["terms","words","vocab","list"]`

UPDATED to more clearly state the problem:

The script I'm making is to remove near-duplicates from a long list of keywords. I've managed to remove misspellings and slight variants (ex. "hitsory terms", "history term").

My problem is that I have multiple terms that I'm looking for in this list of keywords, but after I've found one of these terms in a keyword (ex. "history terms") all keywords that are identical except with a different term or combination of terms (ex. "history vocab", "history words", "history words terms", etc.) should be considered duplicates.

  • It is OK to have multiple terms in the keyword (ex. "math terms words vocab") as long as there is not a keyword that is identical save for having a lower number of the terms (ex. "math terms words" or ideally a single term like "math vocab").
DukeSilver
  • 458
  • 1
  • 6
  • 22
  • http://stackoverflow.com/questions/3931541/python-check-if-all-of-the-following-items-is-in-a-list – drum Sep 16 '16 at 22:51
  • @drum - That question doesn't seem applicable. – TigerhawkT3 Sep 16 '16 at 22:56
  • 1
    Removing `1) any keywords that has more than one of the main_terms in it` with an output of `"math terms words vocab"` which contains three doesn't make sense to me. – TemporalWolf Sep 16 '16 at 23:00
  • 2
    This appears to be an [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem), as your output doesn't match the explanation. – TemporalWolf Sep 16 '16 at 23:18
  • 1
    @TemporalWolf, you're right, sorry! I'm new to this and was still trying to wrap my head around the problem. I've updated it to hopefully give a better picture of what I'm trying to do. – DukeSilver Sep 30 '16 at 15:33

3 Answers3

1

Loop through the keywords and check each one against the main_terms:

keywords = ["history terms",
            "history words",
            "history vocab",
            "history words terms",
            "history vocab words",
            "science list",
            "science terms vocab",
            "math terms words vocab"]
main_terms = {"terms","words","vocab","list"}
result = {}
for words in keywords:
    s = set(words.split())
    s_subject = s - main_terms
    subject = s_subject and next(iter(s_subject))
    if s | main_terms and subject and subject not in result:
        result[subject] = words

The turn the result values into a list:

>>> list(result.values())
['math terms words vocab', 'history terms', 'science list']
martineau
  • 119,623
  • 25
  • 170
  • 301
TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
  • I think this actually solves my problem! Thanks! Is there any way that I can rank one of the main_terms above the others? As it currently stands, if "history terms" comes before "history vocab", the latter is eliminated, but what if I preferred to keep all of the keywords that had "terms" in them over keywords that were identical but with a different main_term? – DukeSilver Sep 30 '16 at 21:43
  • @DukeSilver - If you change the `if` to `if s | main_terms and subject and subject not in result or 'terms' in s:`, that will save (in this example) `'science':'science terms vocab'` instead of `'science':'science list'`. Is that what you had in mind? – TigerhawkT3 Oct 01 '16 at 04:29
  • not quite (I misworded my comment above, sorry!). I don't want to return things that weren't in the original keyword list. The problem is that it keeps the first main_term instance that it finds, and eliminates the rest, but I'd love for it to 'rank' the main_terms. For instance, have it look for `history list` but if that's not a keyword, have it look for `history vocab`. `history vocab` is in the keywords list, so it would keep that one and eliminate the others (`history terms`,`history words`, etc). Do that make sense? – DukeSilver Oct 03 '16 at 15:03
0

I'm sure there's a more elegant solution, but this seems to be the solution for which you're looking, at least for part 1):

>>> def remove_main_terms(keyword):
        words = keyword.split()
        count = 0
        to_keep = []
        for word in words:
            if word in main_terms:
                count += 1
            if count < 2:
                to_keep.append(word)
            else:
                pass
        return " ".join(to_keep)

>>> keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]

>>> main_terms = ["terms","words","vocab","list"]

>>> new_list = []
>>> for w in keywords:
        new_list.append(remove_main_terms(w))

>>> new_list
['history terms', 'history words', 'history vocab', 'history words', 'history vocab', 'science list', 'science terms', 'math terms']
blacksite
  • 12,086
  • 10
  • 64
  • 109
0

EDIT: I'm increasingly thinking you're asking an XY Question and you want unique subjects.

If that is the case, the following works even better:

result = []
found = []
for word in keywords:
    for term in main_terms:
        if term in word:
            word = word.replace(term, "")
    result.append(word.strip())

print set(result)

Which outputs set(['science', 'math', 'history'])


This solves your original problem with the same results, but does it by ignoring terms after the first and only passing unique first words.

result = []
found = []
for word in keywords:
    found = False
    for res in result:
        if word.split()[0] in res:
            found = True
    if not found:
        result.append(word)
print result

See the demo on repl.it

Community
  • 1
  • 1
TemporalWolf
  • 7,727
  • 1
  • 30
  • 50