3

I'm writing a function that finds a string near a identical string(s) in a larger piece of text. So far so good, just not pretty.

I'm having trouble trimming the resulting string to the nearest sentence/whole word, without leaving any characters hanging over. The trim distance is based on a number of words either side of the keyword.

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"

with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"

what I've got so far is based on character, not word distance.

2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"

However a regex could split it to the nearest whole word or sentence. Is that the most Pythonic way to achieve this? This is what I've got so far:

import re

def trim_string(s, num):
  trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
  #^(.*)(marble)(.+) # only finds second occurrence???

  return trimmed

s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"


if t.lower() in s.lower():

  count = s.lower().count(t.lower())
  print ("%s occurrences of %s" %(count, t))

  original_s = s

  for i in range (0, count):
    idx = s.index(t.lower())
    # print idx

    dist = 10
    start = idx-dist
    end = len(t) + idx+dist
    a = s[start:end]

    print a
    print trim_string(a,5)

    s = s[idx+len(t):]

Thank you.

Ghoul Fool
  • 6,249
  • 10
  • 67
  • 125
  • How do you want to handle whitespace? If you just consider single spaces between "words" you could use `.split()` on the input text then use list indexes to manipulate sub-set and re-join the words into a single string. It gets you out of using regex if that's a benefit to you. – Matt R. Wilson Aug 11 '17 at 20:15
  • I don't want any leading or trailing whitespace in the results, if that's what you mean. The inclusion of of ellipsis (...) is there to illustrate that the string has been broken at that point. – Ghoul Fool Aug 11 '17 at 20:20

4 Answers4

3

You can use this regex to match up to N non-whitespace substring on either side of marble:

2 words:

(?:(?:\S+\s+){0,2})?\bmarble\b\S*(?:\s+\S+){0,2}

RegEx Breakup:

(?:(?:\S+\s+){0,2})? # match up to 2 non-whitespace string before keyword (lazy)
\bmarble\b\S*        # match word "marble" followed by zero or more non-space characters
(?:\s+\S+){0,2}      # match up to 2 non-whitespace string after keyword

RegEx Demo

1 word regex:

(?:(?:\S+\s+){0,1})?\bmarble\b\S*(?:\s+\S+){0,1}
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

You can do this without re if you ignore the punctuation:

import itertools as it
import string

def nwise(iterable, n):
    ts = it.tee(iterable, n)
    for c, t in enumerate(ts):
        next(it.islice(t, c, c), None)
    return zip(*ts)

def grep(s, k, n):
    m = str.maketrans('', '', string.punctuation)
    return [' '.join(x) for x in nwise(s.split(), n*2+1) if x[n].translate(m).lower() == k]

In []
keyword = "marble"
sentence = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
print('...\n...'.join(grep(sentence, keyword, n=2)))

Out[]:
Right. This marble is as...
...as this marble. Kwoo-oooo-waaa! Ahhhk!

In []:
print('...\n...'.join(grep(sentence, keyword, n=1)))

Out[]:
This marble is...
...this marble. Kwoo-oooo-waaa!
AChampion
  • 29,683
  • 4
  • 59
  • 75
1

Using the ngrams() function from this answer, here's one approach which just takes all the n-grams and then chooses the ones with keyword in the middle:

def get_ngrams(document, n):
    words = document.split(' ')
    ngrams = []
    for i in range(len(words)-n+1):
        ngrams.append(words[i:i+n])
    return ngrams

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

n = 3
pos = int(n/2 - .5)
# ignore punctuation by matching the middle word up to the number of chars in keyword
result = [ng for ng in get_ngrams(string, n) if ng[pos][:len(keyword)] == keyword]
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
0

more_itertools.adajacent1 is a tool that probes neighboring elements.

import operator as op
import itertools as it

import more_itertools as mit


# Given
keyword = "marble"
iterable = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

Code

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['This marble is', 'this marble. Kwoo-oooo-waaa!']

neighbors = mit.adjacent(pred, words, distance=2)
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['Right. This marble is as', 'as this marble. Kwoo-oooo-waaa! Ahhhk!']

The OP may adjust the final output of these results as desired.


Details

The given string has been split into an iterable of words. A a simple predicate2 was defined, returning True if the keyword (or a keyword with a trailing period) is found in the iterable.

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)
list(neighbors)

A list of (bool, word) tuples are returned from the more_itertools.adjacent tool:

Output

[(False, 'Right.'),
 (True, 'This'),
 (True, 'marble'),
 (True, 'is'),
 (False, 'as'),
 (False, 'slippery'),
 (False, 'as'),
 (True, 'this'),
 (True, 'marble.'),
 (True, 'Kwoo-oooo-waaa!'),
 (False, 'Ahhhk!')]

The first index is True for any valid occurences of keywords and neighboring words with a distance of 1. We use this boolean and itertools.groupby to find and group together consecutive, neighboring items. For example:

neighbors = mit.adjacent(pred, words, distance=1)
[(k, list(g)) for k, g in it.groupby(neighbors, op.itemgetter(0))]

Output

[(False, [(False, 'Right.')]),
 (True, [(True, 'This'), (True, 'marble'), (True, 'is')]),
 (False, [(False, 'as'), (False, 'slippery'), (False, 'as')]),
 (True, [(True, 'this'), (True, 'marble.'), (True, 'Kwoo-oooo-waaa!')]),
 (False, [(False, 'Ahhhk!')])]

Finally, we apply a condition to filter the False groups and join the strings together.

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]

Ouput

['This marble is', 'this marble. Kwoo-oooo-waaa!']

1more_itertools is a third-party library that implements many useful tools including the itertools recipes.

2Note, stronger predicates can certainly be made for keywords with any punctuation, but this one was used for simplicity.

pylang
  • 40,867
  • 14
  • 129
  • 121