1

thank you for your help in advance.

I have a list of strings

full_name_list = ["hello all","cat for all","dog for all","cat dog","hello cat","cat hello"]

I need to do a percent match between each element to all the elements in the list. For example, I need to first break down "hello all" into ["hello", "all"] and I can see that "hello" is in "hello cat" thus that would be a 50% match. Here is what I have so far,

    hello all   [u'hello', u'hello all', u'hello cat', u'cat hello'] [u'all', u'hello all', u'cat for all', u'dog for all'] 
    cat for all [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    [u'for', u'cat for all', u'dog for all']    [u'all', u'hello all', u'cat for all', u'dog for all']
    dog for all [u'dog', u'dog for all', u'cat dog']    [u'for', u'cat for all', u'dog for all']    [u'all', u'hello all', u'cat for all', u'dog for all']
    cat dog     [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    [u'dog', u'dog for all', u'cat dog']    
    hello cat   [u'hello', u'hello all', u'hello cat', u'cat hello']    [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    
    cat hello   [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    [u'hello', u'hello all', u'hello cat', u'cat hello']    

As you can see the first word in each sublist contains the substring that is being searched followed by the elements that contain that substring. I am able to do this for one word matches, and I realized that I can continue this process by simply taking the intersection between individual words to get dual matches, e.g.

    cat for all [(cat,for)  [u'cat for all']]   [(for,all)  [u'cat for all', u'dog for all']]

The problem Im having is doing this recursively since I dont know how long my longest string is going to be. Also, is there a better way to do this string search? Ultimately I want to find the strings that match 100% because realistically "hello cat" == "cat hello". I also want to find the 50% matches and so on.

An idea I was given was using a binary tree, but how can I go about doing this in python? Here is my code so far:

logical_list = []
logical_list_2 = []
logical_list_3 = []
logical_list_4 = []
match_1 = []
match_2 = []
i = 0

logical_name_full = logical_df['Logical'].tolist()
for x in logical_name_full:
    logical_sublist = [x]+x.split()
    logical_list.append(logical_sublist)



for sublist in logical_list:
    logical_list_2.append(sublist[0])
    for split_words in  sublist[1:]:
        match_1.append(split_words)
        for logical_names in logical_name_full:
            if split_words in logical_names:
                match_1.append(logical_names)
        logical_list_2.append(match_1)
        match_1 = []
    logical_list_3.append(logical_list_2)
    logical_list_2 = []
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189

3 Answers3

0

If I'm understanding the issue correctly, you have a list of strings, and you want to find the % match of a word in said string, the percentage being determined by how many words of the string, from the total number of words, is said word. If so, this code example should be enough:

for i in full_name_list:
    if word in i.split(" "):
        total_words = len(i.split(" "))
        match_words = 0
        for w in i.split(" "):
            if word == w:
                match_words += 1
        print(i + " Word match: " + str((match_words/total_words)*100) + "%")

For matching multi-word strings, where order of words in the matched string is not important: word = "test string" full_name_list = ["test something", "test string something", "test string", "string test", "string something test"] results = []

for i in full_name_list:
    if len([item for item in word if item in i]) > 0:
        total_words = len(i.split(" "))
        match_words = 0.0
        for single_word in word.split(" "):
            for w in i.split(" "):
                if single_word == w:
                    match_words += 1
        results.append(i + "," + str((match_words/total_words)*100) + "%")

with open("file.csv", "w") as f:
    for i in results:
        f.write(i+"\n")
Jan Novák
  • 30
  • 1
  • 12
  • This is cool, I just need to figure out how to use each word in my list in place of the variable "word" and then print this all out to excel so that I can see a visualization of all the percent matches. Thank you for giving me this other view of the idea because instead of going through each component of the full words and matching them one by one to other full words, I could just see how each full word relates to other full words. – Edward Mordechay Sep 21 '17 at 17:13
  • @EdwardMordechay You could just take your array and .join(" ") it into the word variable. – Jan Novák Sep 21 '17 at 17:39
  • @EdwardMordechay And for visualising to excel, you could just use csv. in which case, your results.append file would look like this: results.append(i + "," + str((match_words/total_words)*100) + "%") You'd need some file IO afterwards. – Jan Novák Sep 21 '17 at 17:43
0

I think I know what you're asking for (if not, just comment on my answer, I'll try to help). I wrote a small program that does what I think you're asking for:

full_name_list = ["hello all","cat for all","dog for all","cat dog","hello cat","cat hello"]

for i in range(len(full_name_list)):
    full_name_list[i] = full_name_list[i].split(' ')

def match(i, j):
    word = full_name_list[i][j]

    for fullname in full_name_list:
        if full_name_list.index(fullname) == i: continue

        for name in fullname:
            if word == name:
                fullname_str = fullname[0]

                for i in range(1,len(fullname)):
                    fullname_str += ' ' + fullname[i]

                return '"{}" is a {}% match to "{}"'.format(name, int(100/len(fullname)), fullname_str)

print(match(0,1))

You input two parameters, i for the index of the name in the list, and j for the index of the name in the fullname. It then returns the string that the function matched the name to, and how well it matches. It also avoids matching the word to itself. I ran the function once at the bottom. It finds a match of the word all as in hello all, and succeeds.

Again, please tell me if I didn't answer it well. It only returns the first match it finds, but it can be easily modified to return all of them.

paper man
  • 488
  • 5
  • 19
  • This is exactly what I need, but I need a few extra parts, 1. I have to go through each component in full_name_list to get the percent matches 2. If the string has multiple words I need it to find the individual word matches and the multiple-word matches, e.g. in the example we search for "all" but I also need to search for "hello" and "hello all" 3. I need a return of all the words that had at least one match – Edward Mordechay Sep 21 '17 at 17:04
0

I made the changes that you asked for. Just so you know, I used a subset function that I got from here, and it imports from itertools (which is built in with python). If this is an issue, notify me.

Here is the new code. I ran it at the bottom just so you can see what it's like in action. You input an index i into the matches function, where i is the index of the name in full_name_list. I believe that it's everything you asked for.

from itertools import chain, combinations

full_name_list = ["hello all","cat for all","dog for all","cat dog","hello cat","cat hello"]

for i in range(len(full_name_list)):
    full_name_list[i] = full_name_list[i].split(' ')

def powerset(iterable):
    s = list(iterable)
    return list(chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1)))


def subset(string, container):  
    if string not in powerset(container): return False

    return True

def makestring(names):
    fullname_str = names[0]

    for i in range(1,len(names)):
        fullname_str += ' ' + names[i]

    return fullname_str

def matches(i):
    results = []

    fullname = full_name_list[i]
    fullnamePS = powerset(fullname)

    for fullname in full_name_list:
        if full_name_list.index(fullname) == i: continue

        for names in fullnamePS:
            if subset(names, fullname): 

                results.append((int(100 * len(names)/len(fullname)), makestring(names), makestring(fullname)))

    return results

for result in matches(1):
    print('"{}" is a {}% match to "{}"'.format(result[1],result[0],result[2]))
paper man
  • 488
  • 5
  • 19