3

This is related to following question - Searching for Unicode characters in Python

I have string like this -

sentence = 'AASFG BBBSDC FEKGG SDFGF'

I split it and get list of words like below -

sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']

I search of part of a word using following code and get whole word -

[word for word in sentence.split() if word.endswith("GG")]

It returns ['FEKGG']

Now i need to find out what is infront and behind of that word.

For example when i search for "GG" it returns ['FEKGG']. Also it should able to get

behind = 'BBBSDC'
infront = 'SDFGF'
Community
  • 1
  • 1
ChamingaD
  • 2,908
  • 8
  • 35
  • 58

5 Answers5

3

Using this generator:

If you have the following string (edited from original):

sentence = 'AASFG BBBSDC FEKGG SDFGF KETGG'

def neighborhood(iterable):
    iterator = iter(iterable)
    prev = None
    item = iterator.next()  # throws StopIteration if empty.
    for next in iterator:
        yield (prev,item,next)
        prev = item
        item = next
    yield (prev,item,None)

matches = [word for word in sentence.split() if word.endswith("GG")]
results = []

for prev, item, next in neighborhood(sentence.split()):
    for match in matches:
        if match == item:
            results.append((prev, item, next))

This returns:

[('BBBSDC', 'FEKGG', 'SDFGF'), ('SDFGF', 'KETGG', None)]
Community
  • 1
  • 1
jrd1
  • 10,358
  • 4
  • 34
  • 51
2

Here's one possibility:

words = sentence.split()
[pos] = [i for (i, word) in enumerate(words) if word.endswith("GG") ]
behind = words[pos - 1]
infront = words[pos + 1]

You might need to take care with edge-cases, such as "…GG" not appearing, appearing more than once, or being the first and/or last word. As it stands, any of these will raise an exception, which may well be the correct behaviour.

A completely different solution using regexes avoids splitting the string into an array in the first place:

match = re.search(r'\b(\w+)\s+(?:\w+GG)\s+(\w+)\b', sentence)
(behind, infront) = m.groups()
Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
1

This is one way. The infront and behind elements will be None if the "GG" word is at the beginning or end of the sentence.

words = sentence.split()
[(infront, word, behind) for (infront, word, behind) in 
 zip([None] + words[:-1], words, words[1:] + [None])
 if word.endswith("GG")]
Whatang
  • 9,938
  • 2
  • 22
  • 24
1
sentence = 'AASFG BBBSDC FEKGG SDFGF AAABGG FOOO EEEGG'

def make_trigrams(l):
    l = [None] + l + [None]

    for i in range(len(l)-2):
        yield (l[i], l[i+1], l[i+2])


for result in [t for t in make_trigrams(sentence.split()) if t[1].endswith('GG')]:
    behind,match,infront = result

    print 'Behind:', behind
    print 'Match:', match
    print 'Infront:', infront, '\n'

Output:

Behind: BBBSDC
Match: FEKGG
Infront: SDFGF

Behind: SDFGF
Match: AAABGG
Infront: FOOO

Behind: FOOO
Match: EEEGG
Infront: None
DevLounge
  • 8,313
  • 3
  • 31
  • 44
1

another itertools based option, may be more memory friendly on large datasets

from itertools import tee, izip
def sentence_targets(sentence, endstring):
   before, target, after = tee(sentence.split(), 3)
   # offset the iterators....
   target.next()
   after.next()
   after.next()
   for trigram in izip(before, target, after):
       if trigram[1].endswith(endstring): yield trigram

EDIT: fixed typo

theodox
  • 12,028
  • 3
  • 23
  • 36