0

I am trying to do the following :

grab 5 words before the search phrase (or Y if there is only Y words there) and 5 words after the search phrase (or Y if there is only Y words there) from a block of text (when I say words I mean words or numbers whatever is in the block of text)

eg

The block of text: "Welcome to Stack Overflow! Visit your user page to set your name and email."

if you was to search "visit your" it would return: "Welcome to Stack Overflow! Visit your user page to set your"

I've tried using this

$preg_safe = str_replace(" ", "\s", preg_quote($search)); 
$pattern = "/(\w*\S\s+){0,8}\S*\b($preg_safe)\b\S*(\s\S+){0,8}/ix";
if(preg_match_all($pattern, $full_text, $matches))
{ 
    $result = str_replace(strtolower($search), "<span class='searched-for'>$search</span>", strtolower($matches[0][0])); 
}
else
{ 
    $result = false; 
}

And it works if the search phrase is in English, but I need it to work in other languages as well. It doesn't work for an Hebrew search phrase for example.

I've tried to change the pattern to :

$pattern = "(*UTF8)/(\w*\S\s+){0,8}\S*\b($preg_safe)\b\S*(\s\S+){0,8}/i";

But it didn't work.

How can I make it work for other languages?

////////////////// EDIT //////////

As enrico.bacis suggested - I've changed the pattern to :

$pattern = "/(\w\p{Hebrew}*\S\s+){0,20}\S*\b($preg_safe)\b\S*(\s\S+){0,20}/ixu";

Now it works for English and Hebrew search phrases, but the result text is being cut when there is a special character (' for example).

How can I make the pattern return the text around the search phrase even if it contains special characters?

Shani1351
  • 509
  • 4
  • 10
  • 25

1 Answers1

1

Your problem is on the \w that is not matching Hebrew characters, in fact \w is just a shortcut for a so-called "word" character: [A-Za-z0-9_].

To make a regex able to capture also Hebrew characters you need only to make two changes:

  • Add u to the modifier to manage UTF8 characters (so your modifier will be /ixu)

  • Replace [\w\p{Hebrew}] for every occurrence of \w in your pattern.

You can also check here for more answers on this topic.

Community
  • 1
  • 1
enrico.bacis
  • 30,497
  • 10
  • 86
  • 115
  • I need it to work for Hebrew and English and in the future there will be other languages as well – Shani1351 Oct 25 '12 at 10:42
  • I explained it better, check now – enrico.bacis Oct 25 '12 at 14:43
  • Thank you for your answer. Please see the edit section in the original question – Shani1351 Oct 25 '12 at 16:26
  • You have to decide if it's easier for you to list the "special characters" you want to include or the ones you want to use as delimiters and then include them in your pattern. – enrico.bacis Oct 25 '12 at 19:09
  • Can you give an example on how to use the list of "special characters" I want to include in the pattern? – Shani1351 Oct 28 '12 at 10:06
  • For example if you want to include '§@#: Replace [\w\p{Hebrew}'§@#] for every occurrence of \w in your pattern. If you want to include also a dash in must be the last (or the first) so [\w\p{Hebrew}'§@#-] – enrico.bacis Oct 28 '12 at 10:13