I'm writing a function that finds a string near a identical string(s) in a larger piece of text. So far so good, just not pretty.
I'm having trouble trimming the resulting string to the nearest sentence/whole word, without leaving any characters hanging over. The trim distance is based on a number of words either side of the keyword.
keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"
with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"
what I've got so far is based on character, not word distance.
2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"
However a regex could split it to the nearest whole word or sentence. Is that the most Pythonic way to achieve this? This is what I've got so far:
import re
def trim_string(s, num):
trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
#^(.*)(marble)(.+) # only finds second occurrence???
return trimmed
s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"
if t.lower() in s.lower():
count = s.lower().count(t.lower())
print ("%s occurrences of %s" %(count, t))
original_s = s
for i in range (0, count):
idx = s.index(t.lower())
# print idx
dist = 10
start = idx-dist
end = len(t) + idx+dist
a = s[start:end]
print a
print trim_string(a,5)
s = s[idx+len(t):]
Thank you.