How to find strings from one list based on list of indexes from other list with some conditions in Python?

Question

I'm new to python and continuously learning to build better codes in python. I have two lists; one with indexes stored in x variable, where the indexes in the x represents index of tuples in list named bb with string ('IN') and surrounded on both sides at least with one tuple containing 'NN'.

What i'm trying to get from the below code is, From every index mentioned in x in bb, how many continuous strings starting with 'NN' are present on both sides of the string tuple in bb list.

I tried the below code, but the code isn't efficient enough. Anybody please help me in making the code efficient.

     bb = [('The', 'RB'),
     ('company', 'NN'),
     ('whose', 'NNS'),
     ('stock', 'IN'),
     ('has', 'NNP'),
     ('been', 'NNS'),
     ('on', 'NNP'),
     ('tear', 'VBJ'),
     ('this', 'VB'),
     ('week', 'NNS'),
     ('already', 'NN'),
     ('sells', 'IN'),
     ('its', 'NNP'),
     ('graphics', 'NNS'),
     ('processing', 'VB'),
     ('units', 'VBJ'),
     ('biggest', 'NNS'),
     ('cloud', 'NN'),
     ('companies', 'IN'),
     ('just', 'NNP'),
     ('that', 'IN')]

def solvr(bb):
    x = []
    for i in range(len(bb)-1):
        if bb[i][1] == 'IN':
            if 'NN' in (bb[i-1][1]) and 'NN' in (bb[i+1][1]):
                x.append(i)
    #===============================        

    for i in range(len(bb)-1):
        if i in x:
            k=[]
            front = bb[i+1:]
            v = 0-i
            back = bb[:-v]
    #======================

    for i in back:
        if 'NN' in i[1]:
            k.append(i[0])
            [[] for i in k] 
    #================================


    for i, j in enumerate(front):
        if front[i][1][:2] == 'NN':
            k.append(front[i][0])
        else:
            break
    return(k)

>> solvr(bb)

output:

['company',
 'whose',
 'has',
 'been',
 'on',
 'week',
 'already',
 'its',
 'graphics',
 'biggest',
 'cloud',
 'just']

My expectation from code is to get each iteration result in new list with also 'IN' string included in every list.

 [['company', 'whose', 'stock', 'has', 'been', 'on'],
 ['week', 'already', 'sells', 'its', 'graphics'],
 ['biggest', 'cloud', 'companies', 'just']]

Would be thankful if someone provides any changes to my code.

This is just a side observation but your part of speech tagging seems a bit off. — Aleksandar Savkov, May 12 '17 at 12:37
Yes Alex, this list of POS tags are made up ones to check the code, not real tags. — sharathchandramandadi, May 12 '17 at 13:15

score 3 · Accepted Answer · edited May 23 '17 at 10:31

This seems like a good problem for itertools.groupby which groups contiguous elements of a list together based on whether each element is true or not according to some condition that you define.

In your case you can use the following:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 
result = [list(b) for a,b in groups if a]
result = [[w[0] for w in b] for b in result if 'IN' in [w[1] for w in b]]

print(result)

[['company', 'whose', 'stock', 'has', 'been', 'on'], 
 ['week', 'already', 'sells', 'its', 'graphics'], 
 ['biggest', 'cloud', 'companies', 'just', 'that']]

This works because groups splits your original bb list into sublists every time the condition (that the second element is 'IN' or starts with 'NN') goes from false to true (or vice versa). If we display groups you can see how it is split:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 

print([(a,list(b)) for a,b in groups])

[(False, [('The', 'RB')]),
 (True,
  [('company', 'NN'),
   ('whose', 'NNS'),
   ('stock', 'IN'),
   ('has', 'NNP'),
   ('been', 'NNS'),
   ('on', 'NNP')]),
 (False, [('tear', 'VBJ'), ('this', 'VB')]),
 (True,
  [('week', 'NNS'),
   ('already', 'NN'),
   ('sells', 'IN'),
   ('its', 'NNP'),
   ('graphics', 'NNS')]),
 (False, [('processing', 'VB'), ('units', 'VBJ')]),
 (True,
  [('biggest', 'NNS'),
   ('cloud', 'NN'),
   ('companies', 'IN'),
   ('just', 'NNP'),
   ('that', 'IN')])]

The boolean value says whether the following list contains elements that satisfy or don't satisfy the condition. Now all you have to do is only keep the one's who's boolean values are true (the condition is satisfied) and then keep the sublists that contain 'IN' as one of the part of speech tags.

And just for fun, if you want the whole solution as a (almost unreadably long) one-liner, you can use:

[[w[0] for w in b] for b in [list(b) for a,b in itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN'])  if a] if 'IN' in [w[1] for w in b]]

EDIT

In order to only keep the sublists that contain an 'IN' word with at least one 'NN' word on either side you can do the following:

Start with the same initial groups and results variables as before:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 
result = [list(b) for a,b in groups if a]

Apply the same groupby function to the sublists but this time set the condition to having a part of speech equal to 'IN':

result = [[(a,list(b)) for a,b in itertools.groupby(r, lambda x: x[1] == 'IN')] for r in result]

Now iterate through result and remove any elements in the sublists who's boolean value from the groupby is true (POS is 'IN') and it is at the right or left edge of the sublist (index is 0 or -1)

result = [[b for i,(a,b) in enumerate(r) if (a and i not in [0,len(r)-1]) or not a] for r in result]

Now that we have removed these we can join everything together and take out the POS tags to get the correct output format (for details on the list flattening syntax see here)

result = [[w[0] for sub in r for w in sub] for r in result]

print(result)

[['company', 'whose', 'stock', 'has', 'been', 'on'],
 ['week', 'already', 'sells', 'its', 'graphics'],
 ['biggest', 'cloud', 'companies', 'just']]

Thanks @bunji for the code. But the following example gives the output as: [('Closure', 'NN'), ('for', 'IN'), ('the', 'DT'), ('same', 'JJ'), ('issue', 'NN'), ('in', 'IN'), ('INC0300660', 'NNP')] output: [['Closure', 'for'], ['issue', 'in', 'INC0300660']] The condition should be 'IN' surrounded by atleast one 'NN' on both sides. Thanks — sharathchandramandadi, May 12 '17 at 14:05
@sharathreddymandadi my mistake, I didn't realize that you needed at least one `'NN'` on either side. I will edit my answer to cover this. — bunji, May 12 '17 at 14:30

gushitong · Answer 2 · 2017-05-12T14:14:22.320

1

three line version(inner func):

def solve(bb):
    def _solve(lst):
        return False if not len(lst) else _solve(lst[1:]) if "NN" in lst[0][1] else "IN" in lst[0][1]
    return [bb[i][0] for i in range(len(bb)) if "NN" in bb[i][1] and (_solve(bb[0:i][::-1]) or _solve(bb[i:-1]))]

two line version(lambda):

def solve(bb):
    s = lambda lst: False if not len(lst) else s(lst[1:]) if "NN" in lst[0][1] else "IN" in lst[0][1]
    return [bb[i][0] for i in range(len(bb)) if "NN" in bb[i][1] and (s(bb[0:i][::-1]) or s(bb[i:-1]))]

edited May 12 '17 at 14:14

answered May 12 '17 at 13:17

gushitong

1,898
16
24

Thanks for the code @fushitong. But i'm getting IndexError: list index out of range, when tried with other examples like below. [('Closure', 'NN'), ('for', 'IN'), ('the', 'DT'), ('same', 'JJ'), ('issue', 'NN'), ('in', 'IN'), ('INC0300660', 'NNP')] I want to only identify 'IN' tagged strings surrounded by at least one 'NN' on both sides. – sharathchandramandadi May 12 '17 at 13:55

score 1 · Answer 3 · answered May 12 '17 at 14:15

I'm not sure how this fits your requirements as it may yield the same nouns as part of two different hits, for example [N*, N*, IN, N*, N*, IN, N*] --> [[N*, N*, IN, N*, N*], [N*, N*, IN, N*]]. If this is undesired then you will have a different approach. Here you simply keep a back buffer and check for all words if they cover the minimal requirements (N*, IN, N*). If they do then simply construct the full hit. I also use a generator as this is likely to be run on a large amount of data.

def solvr(bb):

    # keep a buffer of the previous tags
    back_buffer = []

    for i in range(len(bb)-1):

        word, tag = bb[i]
        _, next_tag = bb[i+1]

        # make sure there is a minimal hit of 3 tokens
        if tag == 'IN' and next_tag.startswith('N') and len(back_buffer) > 0:
            hit = back_buffer + [word]
            for it in bb[i+1:]:
                if it[1].startswith('N'):
                    hit.append(it[0])
                else:
                    break
            yield hit

        # add to the buffer
        if tag.startswith('N'):
            back_buffer.append(word)

        # reset the buffer as the sequence of N* tags has ended
        else:
            back_buffer = []
print(list(solvr(bb)))

alvas · Answer 4 · 2017-05-12T16:17:42.263

Try this "magic":

>>> bb = [('The', 'RB'), ('company', 'NN'), ('whose', 'NNS'), ('stock', 'IN'), ('has', 'NNP'), ('been', 'NNS'), ('on', 'NNP'), ('tear', 'VBJ'), ('this', 'VB'), ('week', 'NNS'), ('already', 'NN'), ('sells', 'IN'), ('its', 'NNP'), ('graphics', 'NNS'), ('processing', 'VB'), ('units', 'VBJ'), ('biggest', 'NNS'), ('cloud', 'NN'), ('companies', 'IN'), ('just', 'NNP'), ('that', 'IN')]

>>> filter(None, map(str.strip, ' '.join([word if pos.startswith('NN') or pos == 'IN'else '|' for word, pos in bb]).split('|')))
['company whose stock has been on', 'week already sells its graphics', 'biggest cloud companies just that']

Essentially, without crazy nested madness:

tmp = []
answer = []
for word, pos in bb:
    if pos.startswith('NN') or pos == 'IN':
        tmp.append(word)
    else:
        if tmp:
            answer.append(' '.join(tmp))
            tmp = []

if tmp: # Remeber to flush out the last tmp.
    answer.append(' '.join(tmp))

You only iterate through the bb once. This is similar to @bunji's answer with itertools.groupby

How to find strings from one list based on list of indexes from other list with some conditions in Python?

4 Answers4