thank you for your help in advance.
I have a list of strings
full_name_list = ["hello all","cat for all","dog for all","cat dog","hello cat","cat hello"]
I need to do a percent match between each element to all the elements in the list. For example, I need to first break down "hello all"
into ["hello", "all"]
and I can see that "hello"
is in "hello cat"
thus that would be a 50% match. Here is what I have so far,
hello all [u'hello', u'hello all', u'hello cat', u'cat hello'] [u'all', u'hello all', u'cat for all', u'dog for all']
cat for all [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello'] [u'for', u'cat for all', u'dog for all'] [u'all', u'hello all', u'cat for all', u'dog for all']
dog for all [u'dog', u'dog for all', u'cat dog'] [u'for', u'cat for all', u'dog for all'] [u'all', u'hello all', u'cat for all', u'dog for all']
cat dog [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello'] [u'dog', u'dog for all', u'cat dog']
hello cat [u'hello', u'hello all', u'hello cat', u'cat hello'] [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']
cat hello [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello'] [u'hello', u'hello all', u'hello cat', u'cat hello']
As you can see the first word in each sublist contains the substring that is being searched followed by the elements that contain that substring. I am able to do this for one word matches, and I realized that I can continue this process by simply taking the intersection between individual words to get dual matches, e.g.
cat for all [(cat,for) [u'cat for all']] [(for,all) [u'cat for all', u'dog for all']]
The problem Im having is doing this recursively since I dont know how long my longest string is going to be. Also, is there a better way to do this string search? Ultimately I want to find the strings that match 100% because realistically "hello cat" == "cat hello"
. I also want to find the 50% matches and so on.
An idea I was given was using a binary tree, but how can I go about doing this in python? Here is my code so far:
logical_list = []
logical_list_2 = []
logical_list_3 = []
logical_list_4 = []
match_1 = []
match_2 = []
i = 0
logical_name_full = logical_df['Logical'].tolist()
for x in logical_name_full:
logical_sublist = [x]+x.split()
logical_list.append(logical_sublist)
for sublist in logical_list:
logical_list_2.append(sublist[0])
for split_words in sublist[1:]:
match_1.append(split_words)
for logical_names in logical_name_full:
if split_words in logical_names:
match_1.append(logical_names)
logical_list_2.append(match_1)
match_1 = []
logical_list_3.append(logical_list_2)
logical_list_2 = []