Find beginning of sentence in String

Question

I want to display the results of a searchquery in a website with a title and a short description. The short description should be a small part of the page which holds the searchterm. What i want to do is: 1 strip tags in page 2 find first position of seachterm 3 from that position, going back find the beginning (if there is one) of that sentence. 4 Start at the found position in step 3 and display ie 200 characters from there

I need some help with step 3. I think i need an regex that finds the first capital or dot...

score 5 · Answer 1 · answered Oct 10 '08 at 14:18

5

Even that will ultimately fail. Given the sentence "We went to Dr. Smith's office", if your search term is "office", virtually any criterion you use will give you "Smith's office" as your sentence.

answered Oct 10 '08 at 14:18

James Curran

101,701
37
181
258

I posted a slight change to the strategy... can you see any bug in that one. – Mostlyharmless Oct 10 '08 at 14:31

Mostlyharmless · Answer 2 · 2008-10-10T15:25:46.493

The way I would do it is, I would parse the page...

Skip over all the things starting with '<'
When you encounter a "." or [A-Z], start putting it into a buffer till you find another "."
If the buffered string has the search keyword, thats your string! Else. start buffering at the "." you encountered and repeat.

EDIT: As James Curran pointed out, this strategy would fail in some cases... So heres the solution:

What you can do, is to start X number of characters from start of page (after tags)

and then search for your keyword, buffering 2 previous words. When you find it, do something like this: {X} ... {prev-2} {next-2}

Example: This planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn't the small green pieces of paper that were unhappy.

Search Keyword: "suggested"

Result: This planet has - or rather had - a problem ... Many solutions were suggested for this problem...

score 1 · Answer 3 · answered Oct 10 '08 at 14:20

1

For step 3: If you reverse the substring that ends where you want to search backward from, get the position of the first '.' and subtrack that value from the position of your search string.

$offset = stripos( strrev(substr($string, $searchlocation)), '.');
$startloc = $searchlocation - $offset;
$finalstring = substr($string, $startloc, 200);

That may be off by 1, but I think it'll get the job done. Seems like there should be a shorter way to do it.

answered Oct 10 '08 at 14:20

acrosman

12,814
10
39
55

James Curran answer also applies here, this would still fail for Dr. Smith's office. – acrosman Oct 10 '08 at 14:22

score 1 · Answer 4 · answered Oct 10 '08 at 14:53

I think instead of trying to find sentences, I'd think about the amount of context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number of words to select the rest of the context. In this way, you just split the entire corpus on whitespace, find the first occurence of the term (perhaps using a fuzzy match to find subterms and account for punctuation), and apply the above algorithm. You could even be creative about introducing ellipses if the first non-selected term doesn't end in punctuation, etc.

Iterniam · Answer 5 · 2022-03-24T18:08:15.560

To save others from thinking they can beat this problem - it can't be done without accepting either false positives or false negatives. To add to what James Curran said, you either declare Smith the start of the sentence in We went to Dr. Smith's office., or you read This sentence is English. So is this one. as a single sentence. Next to those problems, different forms of abbreviations and Overeager Capitalization Of Every Word Can Kill Your Algorithm Or Regex.

That said, I might as well share the regexes I came up with.

The first regex is simple enough:

(?m)(?:^|[.!?][\t ]+)([A-Z]\S*)

It matches the start of a line or a .!? This is followed by at least one tabs/whitespace, after which a capital letter is matched and the rest of the word (including dots to match abbreviations). The first word of the sentence will be caught in group 1.

The second regex

(?m)[A-Z]\S*\.[^\S\r\n]+[A-Z]|(?:^|[.!?][\t ]+)([A-Z]\S*)

This is the previous regex, prepended with [A-Z]\S*\.[^\S\r\n]+[A-Z]|. This part matches a word starting with a capital, followed by a dot, some whitespace and another capitalized character. Because the first part gets matched, the second part no longer tries to match it (explained in-depth here). The first word of the sentence will again be caught in group 1.

The first regex has false positives: it will wrongly match Smith in the second half of the sentence We went to Dr. Smith's office.
The second regex has false negatives: it will fail to match So in This is sentence is English. So is this one.

Test the regexes here.

Find beginning of sentence in String

5 Answers5