Trim string (both left and right) to nearest word or sentence

Question

I'm writing a function that finds a string near a identical string(s) in a larger piece of text. So far so good, just not pretty.

I'm having trouble trimming the resulting string to the nearest sentence/whole word, without leaving any characters hanging over. The trim distance is based on a number of words either side of the keyword.

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"

with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"

what I've got so far is based on character, not word distance.

2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"

However a regex could split it to the nearest whole word or sentence. Is that the most Pythonic way to achieve this? This is what I've got so far:

import re

def trim_string(s, num):
  trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
  #^(.*)(marble)(.+) # only finds second occurrence???

  return trimmed

s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"


if t.lower() in s.lower():

  count = s.lower().count(t.lower())
  print ("%s occurrences of %s" %(count, t))

  original_s = s

  for i in range (0, count):
    idx = s.index(t.lower())
    # print idx

    dist = 10
    start = idx-dist
    end = len(t) + idx+dist
    a = s[start:end]

    print a
    print trim_string(a,5)

    s = s[idx+len(t):]

Thank you.

How do you want to handle whitespace? If you just consider single spaces between "words" you could use `.split()` on the input text then use list indexes to manipulate sub-set and re-join the words into a single string. It gets you out of using regex if that's a benefit to you. — Matt R. Wilson, Aug 11 '17 at 20:15
I don't want any leading or trailing whitespace in the results, if that's what you mean. The inclusion of of ellipsis (...) is there to illustrate that the string has been broken at that point. — Ghoul Fool, Aug 11 '17 at 20:20

anubhava · Accepted Answer · 2017-08-11T20:57:15.760

3

You can use this regex to match up to N non-whitespace substring on either side of marble:

2 words:

(?:(?:\S+\s+){0,2})?\bmarble\b\S*(?:\s+\S+){0,2}

RegEx Breakup:

(?:(?:\S+\s+){0,2})? # match up to 2 non-whitespace string before keyword (lazy)
\bmarble\b\S*        # match word "marble" followed by zero or more non-space characters
(?:\s+\S+){0,2}      # match up to 2 non-whitespace string after keyword

RegEx Demo

1 word regex:

(?:(?:\S+\s+){0,1})?\bmarble\b\S*(?:\s+\S+){0,1}

edited Aug 11 '17 at 20:57

answered Aug 11 '17 at 20:11

anubhava

761,203
64
569
643

The regex will also capture words like - `marbleilz` should be `\W*` after the word not `\S*`. – Dror Av. Aug 11 '17 at 20:43
This still has a bug if no space comes after the `.` for example - https://regex101.com/r/8HAdYg/3 – Dror Av. Aug 11 '17 at 21:25
1

May be: `(?:(?:\S+\s+){0,2})?\bmarble\b\S?(?:\s*\S+){0,2}` but we don't know if missing space between words is a realistic use case or not. Only OP can tell us. – anubhava Aug 11 '17 at 21:28
That's fair enough :) – Dror Av. Aug 11 '17 at 21:31

AChampion · Answer 2 · 2017-08-12T13:38:17.467

You can do this without re if you ignore the punctuation:

import itertools as it
import string

def nwise(iterable, n):
    ts = it.tee(iterable, n)
    for c, t in enumerate(ts):
        next(it.islice(t, c, c), None)
    return zip(*ts)

def grep(s, k, n):
    m = str.maketrans('', '', string.punctuation)
    return [' '.join(x) for x in nwise(s.split(), n*2+1) if x[n].translate(m).lower() == k]

In []
keyword = "marble"
sentence = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
print('...\n...'.join(grep(sentence, keyword, n=2)))

Out[]:
Right. This marble is as...
...as this marble. Kwoo-oooo-waaa! Ahhhk!

In []:
print('...\n...'.join(grep(sentence, keyword, n=1)))

Out[]:
This marble is...
...this marble. Kwoo-oooo-waaa!

score 1 · Answer 3 · answered Aug 11 '17 at 22:37

Using the ngrams() function from this answer, here's one approach which just takes all the n-grams and then chooses the ones with keyword in the middle:

def get_ngrams(document, n):
    words = document.split(' ')
    ngrams = []
    for i in range(len(words)-n+1):
        ngrams.append(words[i:i+n])
    return ngrams

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

n = 3
pos = int(n/2 - .5)
# ignore punctuation by matching the middle word up to the number of chars in keyword
result = [ng for ng in get_ngrams(string, n) if ng[pos][:len(keyword)] == keyword]

pylang · Answer 4 · 2017-08-24T22:17:15.450

more_itertools.adajacent¹ is a tool that probes neighboring elements.

import operator as op
import itertools as it

import more_itertools as mit


# Given
keyword = "marble"
iterable = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

Code

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['This marble is', 'this marble. Kwoo-oooo-waaa!']

neighbors = mit.adjacent(pred, words, distance=2)
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['Right. This marble is as', 'as this marble. Kwoo-oooo-waaa! Ahhhk!']

The OP may adjust the final output of these results as desired.

Details

The given string has been split into an iterable of words. A a simple predicate² was defined, returning True if the keyword (or a keyword with a trailing period) is found in the iterable.

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)
list(neighbors)

A list of (bool, word) tuples are returned from the more_itertools.adjacent tool:

Output

[(False, 'Right.'),
 (True, 'This'),
 (True, 'marble'),
 (True, 'is'),
 (False, 'as'),
 (False, 'slippery'),
 (False, 'as'),
 (True, 'this'),
 (True, 'marble.'),
 (True, 'Kwoo-oooo-waaa!'),
 (False, 'Ahhhk!')]

The first index is True for any valid occurences of keywords and neighboring words with a distance of 1. We use this boolean and itertools.groupby to find and group together consecutive, neighboring items. For example:

neighbors = mit.adjacent(pred, words, distance=1)
[(k, list(g)) for k, g in it.groupby(neighbors, op.itemgetter(0))]

Output

[(False, [(False, 'Right.')]),
 (True, [(True, 'This'), (True, 'marble'), (True, 'is')]),
 (False, [(False, 'as'), (False, 'slippery'), (False, 'as')]),
 (True, [(True, 'this'), (True, 'marble.'), (True, 'Kwoo-oooo-waaa!')]),
 (False, [(False, 'Ahhhk!')])]

Finally, we apply a condition to filter the False groups and join the strings together.

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]

Ouput

['This marble is', 'this marble. Kwoo-oooo-waaa!']

^{1_{more_itertools is a third-party library that implements many useful tools including the itertools recipes.}}

^{2_{Note, stronger predicates can certainly be made for keywords with any punctuation, but this one was used for simplicity.}}

Trim string (both left and right) to nearest word or sentence

4 Answers4