4

what I want is: Let's supose I searched "goo" using a query that goes like this: ...WHERE message LIKE '%goo%' and it returned me a result, for example I love Google to make my searches, but I'm starting to worry about privacy, so it will be displayed as a result, because the word Google matches my search criteria.

How do I, based on my search string save this entire Google result on a variable? I need this because I'm using a regular expression that will highlight the searched word and display content before and after this result, but it's only working when the searched word matches exactly the word in the result, and also it's malconstructed, so it won't work well with words that are not surrounded by space.

This is the regular expression code

<?=preg_replace('/^.*?\s(.{0,'.$size.'})(\b'.$_GET['s'].'\b)(.{0,'.$size.'})\s.*?$/',
            '...$1<strong>$2</strong>$3...',$message);?>

What I want is that change this $_GET['s'] to my variable which will contain the whole word found in my query string.

How do I achieve this ?

steps
  • 774
  • 2
  • 16
  • 38
  • About the `\B*` not working (I've read your discussion), that's normal. `\B` (just like `\b`) matches **a position** (which is not a word boundary). You can repeat a character, but it makes no sense to repeat a position. – Loamhoof Apr 22 '13 at 12:53

3 Answers3

4

I bet it will be easier to change your regular expression to check any word containing the term, what about:

<?=preg_replace('/^.*?(.{0,'.$size.'})(\b\S*'.$_GET['s'].'\S*\b)(.{0,'.$size.'}).*?$/i',
            '...$1<strong>$2</strong>$3...',$message);?>
arraintxo
  • 484
  • 2
  • 12
  • I think that to make it work as expected I would not only have to check any word containing the term but simulate in REGEX all MySQL LIKE capability (case insensitive, special characters, etc.), I don't think I can achieve that, how would that be? – steps Apr 17 '13 at 18:40
  • 1
    I changed the expression adding /i modifier to make it case insensitive and replaced \w* with .* to match any character, I think that should (almost) work. – arraintxo Apr 17 '13 at 18:46
  • Things got worse :( now, it's wrapping lots of words, before and after the matched one into the strong tag in some results, and in others, it won't wrap anything – steps Apr 17 '13 at 18:49
  • 1
    Try again with my last edit, I made a lazy much with those question marks (.*?). Hope this time it really works :) – arraintxo Apr 17 '13 at 18:51
  • :( Still not working. With some result strings it works, but with others it wraps more words into the tag and sometimes won't wrap a thing. – steps Apr 17 '13 at 18:56
  • 1
    Let's try again taking everything but spaces... :S – arraintxo Apr 17 '13 at 19:19
  • Now it's almost working! But it's still case sensitive (if I search for `peo` it's going to wrap into tags `people` but not `People` or `PEOPLE` or `peOple`. Also, since we are taking everything but spaces, `people!`, `people.`, `people?`, `"people`, `people"` and variants are not working, nor special character variations, like looking for `péo` – steps Apr 17 '13 at 19:24
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/28393/discussion-between-arraintxo-and-joao-paulo-apolinario-passos) – arraintxo Apr 17 '13 at 19:40
2

I read your discussion on this and more robust implementation might be in order. Especially taking your need to support diacritics into account. Using a single regular expression to fix all your problems might seem tempting, but the more complicated it becomes the harder it gets to maintain or expand upon. To quote Jamie Zawinski

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

As I have problems with iconv on my local machine, I used a more simple implementation instead, feel free to use something more complicated or robust if your situation requires it.

I use a simple regular expression in this solution to get a set of alphanumeric characters only (also known as a "word"), the part in the regular expression that reads \p{L}\p{M} makes sure we also get all the multibyte characters.

You can see this code working on IDEone.

<?php
function stripAccents($p_sSubject) {
    $sSubject = (string) $p_sSubject;

    $sSubject = str_replace('æ', 'ae', $sSubject);
    $sSubject = str_replace('Æ', 'AE', $sSubject);

    $sSubject = strtr(
          utf8_decode($sSubject)
        , utf8_decode('àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝ')
        , 'aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUY'
    );


    return $sSubject;
}

function emphasiseWord($p_sSubject, $p_sSearchTerm){

    $aSubjects = preg_split('#([^a-z0-9\p{L}\p{M}]+)#iu', $p_sSubject, null, PREG_SPLIT_DELIM_CAPTURE);

    foreach($aSubjects as $t_iKey => $t_sSubject){
        $sSubject = stripAccents($t_sSubject);
        
        if(stripos($sSubject, $p_sSearchTerm) !== false || mb_stripos($t_sSubject, $p_sSearchTerm) !== false){
            $aSubjects[$t_iKey] = '<strong>' . $t_sSubject . '</strong>';
        }
    }

    $sSubject = implode('', $aSubjects);
    
    return $sSubject;
}


/////////////////////////////// Test \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
$aTest = array(
      'goo' => 'I love Google to make my searches, but I`m starting to worry about privacy.'
    , 'peo' => 'people, People, PEOPLE, peOple, people!, people., people?, "people, people" péo'
    , 'péo' => 'people, People, PEOPLE, peOple, people!, people., people?, "people, people" péo'
    , 'gen' => '"gente", "inteligente", "VAGENS", and "Gente" ...vocês da física que passam o dia protegendo...'
    , 'voce' => '...vocês da física que passam o dia protegendo...'
    , 'o' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
    , 'ø' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
    , 'ae' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
    , 'Æ' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
);

$sContent = '<dl>';
foreach($aTest as $t_sSearchTerm => $t_sSubject){
    $sContent .= '<dt>' . $t_sSearchTerm . '</dt><dd>' . emphasiseWord($t_sSubject, $t_sSearchTerm) .'</dd>';
}
$sContent .= '</dl>';

echo $sContent;
?>
Community
  • 1
  • 1
Potherca
  • 13,207
  • 5
  • 76
  • 94
  • 1
    The utf_decode method you are using does not include æøåÆØÅ, which are used in Denmark, Sweden and Norway. Wouldn't that potentially prove to be an issue? – melwil May 03 '13 at 13:16
  • 1
    @melwil Yes, it would prove an issue, hence my recommendation to the reader to use a more complicated or robust implementation if the situation requires it. If you have a stable development environment (meaning one that mirrors your production servers) you could just use `iconv` to translate the characters. Otherwise you would need to [adjust the code to your situation](http://stackoverflow.com/questions/4491937/converting-to-ae-in-php-with-str-replace). For the fun of it (and to give you a more exact example) I have updated my answer to include the characters you mentioned. – Potherca May 03 '13 at 17:14
0

I don't understand the importance of matching everything else in the search string, wouldn't this simply be enough?

<?=preg_replace('/\b\S*'.$GET['s'].'\S*\b/i', '<strong>$0</strong>', $message);?>

As far as I can tell, you are only putting the matched word in a html tag, but not doing anything to the rest of the string?

The above regex works fine for cases where you are only matching whole words, captures multiple matches within a string (should there be more than one) and also works fine with case insensitivity.

melwil
  • 2,547
  • 1
  • 19
  • 34
  • I think the wrapping parenthesis are not that much useful (you can use `$0` in PHP if I remember correctly). But yeah, seems the easiest solution. – Loamhoof Apr 22 '13 at 12:51
  • @Loamhoof Yes, you are right. It's simply a remnant from my simplification. – melwil Apr 22 '13 at 19:57