Extract X number of words surrounding a given search string within a string

Question

I am looking for a way to extract X number of words on either side of a given word in a search.

For example, if a user enters "inmate" as a search word and the MySQL query finds a post that contains "inmate" in the content of the post, I would like to return not the entire contents of the post but just x number of words on either side of it to give the user the gist of the post and then they can decide if they want to continue on to the post and read it in full.

I am using PHP.

Thanks!

This might also help you: http://stackoverflow.com/q/1436582/1066234 — Avatar, May 24 '15 at 20:52

morja · Accepted Answer · 2011-11-24T09:24:15.127

10

You might not be able to fully solve this problem with regex. There are too many possibilities of other characters between the words...

But you can try this regex:

((?:\S+\s*){0,5}\S*inmate\S*(?:\s*\S+){0,5})

See here : rubular

You might also want to exclude certain characters as they are not counted as words. Right now the regex counts any sequence of non space characters that are surrounded by spaces as word.

To match only real words:

((?:\w+\s*){0,5}<search word>(?:\s*\w+){0,5})

But here any non word character (,". etc.) brakes the matching.

So you can go on...

((?:[\w"',.-]+\s*){0,5}["',.-]?<search word>["',.-]?(?:\s*[\w"',.-]+){0,5})

This would also match 5 words with one of "',.- around your search term.

To use it in php:

$sourcestring="For example, if a user enters \"inmate\" as a search word and the MySQL";
preg_match_all('/(?:\S+\s*){0,5}\S*inmate\S*(?:\s*\S+){0,5}/s',$sourcestring,$matches);
echo $matches[0][0]; // you might have more matches, they will be in $matches[0][x]

edited Nov 24 '11 at 09:24

answered Nov 24 '11 at 00:59

morja

8,297
2
39
59

To add to morja's answer, you could select the string from MySQL with PHP's preg_match: http://php.net/manual/en/function.preg-match.php. – bozdoz Nov 24 '11 at 01:22
Thanks, I will try these out when I get a chance later today. I appreciate the time ya'll have taken to answer this! – programmer guy Nov 24 '11 at 01:27
I have tried it, it works sometimes on Rubular... Hmmm... I have tried to implement it in PHP and I can't seem to wrap my head around it... Could anyone point me in the right direction? – programmer guy Nov 24 '11 at 04:18

score 3 · Answer 2 · answered Feb 21 '12 at 13:50

I would use this regex for php which also takes UTF8 characters into account

'~(?:[\p{L}\p{N}\']+[^\p{L}\p{N}\']+){0,5}<search word>(?:[^\p{L}\p{N}\']+[\p{L}\p{N}\']+){0,5}~u'

In this case '~' is the delimiter and the modificator 'u' at the end identifies the regex is UTF8 interpreted.

please see a documentation about the Unicode Regex identifiers here:

http://www.regular-expressions.info/refunicode.html

Extract X number of words surrounding a given search string within a string

2 Answers2

Linked

Related