4

I'm new to python and continuously learning to build better codes in python. I have two lists; one with indexes stored in x variable, where the indexes in the x represents index of tuples in list named bb with string ('IN') and surrounded on both sides at least with one tuple containing 'NN'.

What i'm trying to get from the below code is, From every index mentioned in x in bb, how many continuous strings starting with 'NN' are present on both sides of the string tuple in bb list.

I tried the below code, but the code isn't efficient enough. Anybody please help me in making the code efficient.

     bb = [('The', 'RB'),
     ('company', 'NN'),
     ('whose', 'NNS'),
     ('stock', 'IN'),
     ('has', 'NNP'),
     ('been', 'NNS'),
     ('on', 'NNP'),
     ('tear', 'VBJ'),
     ('this', 'VB'),
     ('week', 'NNS'),
     ('already', 'NN'),
     ('sells', 'IN'),
     ('its', 'NNP'),
     ('graphics', 'NNS'),
     ('processing', 'VB'),
     ('units', 'VBJ'),
     ('biggest', 'NNS'),
     ('cloud', 'NN'),
     ('companies', 'IN'),
     ('just', 'NNP'),
     ('that', 'IN')]

def solvr(bb):
    x = []
    for i in range(len(bb)-1):
        if bb[i][1] == 'IN':
            if 'NN' in (bb[i-1][1]) and 'NN' in (bb[i+1][1]):
                x.append(i)
    #===============================        

    for i in range(len(bb)-1):
        if i in x:
            k=[]
            front = bb[i+1:]
            v = 0-i
            back = bb[:-v]
    #======================

    for i in back:
        if 'NN' in i[1]:
            k.append(i[0])
            [[] for i in k] 
    #================================


    for i, j in enumerate(front):
        if front[i][1][:2] == 'NN':
            k.append(front[i][0])
        else:
            break
    return(k)

>> solvr(bb)

output:

['company',
 'whose',
 'has',
 'been',
 'on',
 'week',
 'already',
 'its',
 'graphics',
 'biggest',
 'cloud',
 'just']

My expectation from code is to get each iteration result in new list with also 'IN' string included in every list.

 [['company', 'whose', 'stock', 'has', 'been', 'on'],
 ['week', 'already', 'sells', 'its', 'graphics'],
 ['biggest', 'cloud', 'companies', 'just']]

Would be thankful if someone provides any changes to my code.

4 Answers4

3

This seems like a good problem for itertools.groupby which groups contiguous elements of a list together based on whether each element is true or not according to some condition that you define.

In your case you can use the following:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 
result = [list(b) for a,b in groups if a]
result = [[w[0] for w in b] for b in result if 'IN' in [w[1] for w in b]]

print(result)

[['company', 'whose', 'stock', 'has', 'been', 'on'], 
 ['week', 'already', 'sells', 'its', 'graphics'], 
 ['biggest', 'cloud', 'companies', 'just', 'that']]

This works because groups splits your original bb list into sublists every time the condition (that the second element is 'IN' or starts with 'NN') goes from false to true (or vice versa). If we display groups you can see how it is split:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 

print([(a,list(b)) for a,b in groups])

[(False, [('The', 'RB')]),
 (True,
  [('company', 'NN'),
   ('whose', 'NNS'),
   ('stock', 'IN'),
   ('has', 'NNP'),
   ('been', 'NNS'),
   ('on', 'NNP')]),
 (False, [('tear', 'VBJ'), ('this', 'VB')]),
 (True,
  [('week', 'NNS'),
   ('already', 'NN'),
   ('sells', 'IN'),
   ('its', 'NNP'),
   ('graphics', 'NNS')]),
 (False, [('processing', 'VB'), ('units', 'VBJ')]),
 (True,
  [('biggest', 'NNS'),
   ('cloud', 'NN'),
   ('companies', 'IN'),
   ('just', 'NNP'),
   ('that', 'IN')])]

The boolean value says whether the following list contains elements that satisfy or don't satisfy the condition. Now all you have to do is only keep the one's who's boolean values are true (the condition is satisfied) and then keep the sublists that contain 'IN' as one of the part of speech tags.

And just for fun, if you want the whole solution as a (almost unreadably long) one-liner, you can use:

[[w[0] for w in b] for b in [list(b) for a,b in itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN'])  if a] if 'IN' in [w[1] for w in b]]

EDIT

In order to only keep the sublists that contain an 'IN' word with at least one 'NN' word on either side you can do the following:

Start with the same initial groups and results variables as before:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 
result = [list(b) for a,b in groups if a]

Apply the same groupby function to the sublists but this time set the condition to having a part of speech equal to 'IN':

result = [[(a,list(b)) for a,b in itertools.groupby(r, lambda x: x[1] == 'IN')] for r in result]

Now iterate through result and remove any elements in the sublists who's boolean value from the groupby is true (POS is 'IN') and it is at the right or left edge of the sublist (index is 0 or -1)

result = [[b for i,(a,b) in enumerate(r) if (a and i not in [0,len(r)-1]) or not a] for r in result]

Now that we have removed these we can join everything together and take out the POS tags to get the correct output format (for details on the list flattening syntax see here)

result = [[w[0] for sub in r for w in sub] for r in result]

print(result)

[['company', 'whose', 'stock', 'has', 'been', 'on'],
 ['week', 'already', 'sells', 'its', 'graphics'],
 ['biggest', 'cloud', 'companies', 'just']]
Community
  • 1
  • 1
bunji
  • 5,063
  • 1
  • 17
  • 36
  • Thanks @bunji for the code. But the following example gives the output as: [('Closure', 'NN'), ('for', 'IN'), ('the', 'DT'), ('same', 'JJ'), ('issue', 'NN'), ('in', 'IN'), ('INC0300660', 'NNP')] output: [['Closure', 'for'], ['issue', 'in', 'INC0300660']] The condition should be 'IN' surrounded by atleast one 'NN' on both sides. Thanks – sharathchandramandadi May 12 '17 at 14:05
  • I didn't know about `groupby` that is very useful. – Aleksandar Savkov May 12 '17 at 14:25
  • @sharathreddymandadi my mistake, I didn't realize that you needed at least one `'NN'` on either side. I will edit my answer to cover this. – bunji May 12 '17 at 14:30
1

three line version(inner func):

def solve(bb):
    def _solve(lst):
        return False if not len(lst) else _solve(lst[1:]) if "NN" in lst[0][1] else "IN" in lst[0][1]
    return [bb[i][0] for i in range(len(bb)) if "NN" in bb[i][1] and (_solve(bb[0:i][::-1]) or _solve(bb[i:-1]))]

two line version(lambda):

def solve(bb):
    s = lambda lst: False if not len(lst) else s(lst[1:]) if "NN" in lst[0][1] else "IN" in lst[0][1]
    return [bb[i][0] for i in range(len(bb)) if "NN" in bb[i][1] and (s(bb[0:i][::-1]) or s(bb[i:-1]))]
gushitong
  • 1,898
  • 16
  • 24
  • Thanks for the code @fushitong. But i'm getting IndexError: list index out of range, when tried with other examples like below. [('Closure', 'NN'), ('for', 'IN'), ('the', 'DT'), ('same', 'JJ'), ('issue', 'NN'), ('in', 'IN'), ('INC0300660', 'NNP')] I want to only identify 'IN' tagged strings surrounded by at least one 'NN' on both sides. – sharathchandramandadi May 12 '17 at 13:55
1

I'm not sure how this fits your requirements as it may yield the same nouns as part of two different hits, for example [N*, N*, IN, N*, N*, IN, N*] --> [[N*, N*, IN, N*, N*], [N*, N*, IN, N*]]. If this is undesired then you will have a different approach. Here you simply keep a back buffer and check for all words if they cover the minimal requirements (N*, IN, N*). If they do then simply construct the full hit. I also use a generator as this is likely to be run on a large amount of data.

def solvr(bb):

    # keep a buffer of the previous tags
    back_buffer = []

    for i in range(len(bb)-1):

        word, tag = bb[i]
        _, next_tag = bb[i+1]

        # make sure there is a minimal hit of 3 tokens
        if tag == 'IN' and next_tag.startswith('N') and len(back_buffer) > 0:
            hit = back_buffer + [word]
            for it in bb[i+1:]:
                if it[1].startswith('N'):
                    hit.append(it[0])
                else:
                    break
            yield hit

        # add to the buffer
        if tag.startswith('N'):
            back_buffer.append(word)

        # reset the buffer as the sequence of N* tags has ended
        else:
            back_buffer = []
print(list(solvr(bb)))
Aleksandar Savkov
  • 2,894
  • 3
  • 24
  • 30
1

Try this "magic":

>>> bb = [('The', 'RB'), ('company', 'NN'), ('whose', 'NNS'), ('stock', 'IN'), ('has', 'NNP'), ('been', 'NNS'), ('on', 'NNP'), ('tear', 'VBJ'), ('this', 'VB'), ('week', 'NNS'), ('already', 'NN'), ('sells', 'IN'), ('its', 'NNP'), ('graphics', 'NNS'), ('processing', 'VB'), ('units', 'VBJ'), ('biggest', 'NNS'), ('cloud', 'NN'), ('companies', 'IN'), ('just', 'NNP'), ('that', 'IN')]

>>> filter(None, map(str.strip, ' '.join([word if pos.startswith('NN') or pos == 'IN'else '|' for word, pos in bb]).split('|')))
['company whose stock has been on', 'week already sells its graphics', 'biggest cloud companies just that']

Essentially, without crazy nested madness:

tmp = []
answer = []
for word, pos in bb:
    if pos.startswith('NN') or pos == 'IN':
        tmp.append(word)
    else:
        if tmp:
            answer.append(' '.join(tmp))
            tmp = []

if tmp: # Remeber to flush out the last tmp.
    answer.append(' '.join(tmp))

You only iterate through the bb once. This is similar to @bunji's answer with itertools.groupby

alvas
  • 115,346
  • 109
  • 446
  • 738