4

I have a two lists as shown below:

c = ['John', 'query 989877 forcast', 'Tamm']
isl = ['My name is Anne Query 989877', 'John', 'Tamm Ju']

I want to check every item in isl with every item in c so that I get all my partial string matches. The output that I need will look like the below:

out = ["john", "query 989877", "tamm"]

As can be seen I have gotten the partial string matches as well.

I have tried the below:

 out = []
 for word in c:
    for w in isl:
        if word.lower() in w.lower():
                 out.append(word)

But this only gives me the output as

out = ["John", "Tamm"]

I have also tried the below:

print [word for word in c if word.lower() in (e.lower() for e in isl)]

But this outputs only "John". How do I get what I want?

user1452759
  • 8,810
  • 15
  • 42
  • 58

2 Answers2

4

Perhaps something like this:

def get_sub_strings(s):
    words = s.split()
    for i in xrange(1, len(words)+1):      #reverse the order here
        for n in xrange(0, len(words)+1-i):
            yield ' '.join(words[n:n+i])
...             
>>> out = []
>>> for word in c:
    for sub in get_sub_strings(word.lower()):
        for s in isl:
            if sub in s.lower():
                out.append(sub)
...                 
>>> out
['john', 'query', '989877', 'query 989877', 'tamm']

If you want to store only the biggest match only then you need to generate the sub-strings in reverse order and break as soon a match is found in isl:

def get_sub_strings(s):
    words = s.split()
    for i in xrange(len(words)+1, 0, -1):
        for n in xrange(0, len(words)+1-i):
            yield ' '.join(words[n:n+i])

out = []
for word in c:
    for sub in get_sub_strings(word.lower()):
        if any(sub in s.lower() for s in isl):
            out.append(sub)
            break

print out
#['john', 'query 989877', 'tamm']
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • This is actually pretty great! Is there anyway to remove "query" and "989877" from the "out" list? Because ideally they should not be in the output. The reason why I'm insisting on this is that, I need to make a count on all the elements in "out" list later on..which will result in an erroneous answer if I leave the output as you have shown. – user1452759 Nov 28 '14 at 07:21
  • @user1452759 Check my second solution. – Ashwini Chaudhary Nov 28 '14 at 08:43
0

Alright I have come up with this! An extremely hacky way to do it; I don't like the method myself but it gives me my output:

Step1:
in: c1 = []
    for r in c:
       c1.append(r.split()) 
out: c1 = [['John'], ['query', '989877', 'forcast'], ['Tamm']]


Step2:
in: p = []
    for w in isl:
        for word in c1:
            for w1 in word:
                 if w1.lower() in w.lower():
                         p.append(w1)
out: p = ['query', '989877', 'John', 'Tamm']


Step3:
in: out = []
    for word in c:
        t = []
        for i in p:
             if i in word:
                t.append(i)
        out.append(t)
out: out = [['John'], ['query', '989877'], ['Tamm']]

Step4:
in: out_final = []
    for i in out:
        out_final.append(" ".join(e for e in i))
out: out_final = ['John', 'query 989877', 'Tamm']
user1452759
  • 8,810
  • 15
  • 42
  • 58