2

I am trying to group words of 4 or more characters with words of 3 or less characters using preg_match_all() in PHP. I am doing this for a keyword search function where users can enter things like "An elephant" and I cannot have any results come back that have just "An" in them.

Therefore instead of breaking the keywords apart by spaces, (e.g. "An", "elephant") I need to put the keywords of three or less characters with the next or previous keyword. (e.g. "An elephant", "History of")

In order to accomplish this I am trying to use conditional sub patterns but I am not sure if I am really on the right track here.

Here's the best I've got so far:

(\s\S{1,3}\s*)?(?(1)\S+)

Yet I seem to also be matching a whole bunch of empty spaces as well. Can someone please point me in the right direction?

In the case of "History of elephants" I am trying to get it to create two matches: "History of", and "elephants".

I cannot simply omit the "stop words" because they are important in this case. The real-life use case is searching for course titles such as "Calculus A" and in that case "A" is important.

Chris Bier
  • 14,183
  • 17
  • 67
  • 103
  • What should happen with "history of elephants"? – Fabian Schmengler Feb 07 '15 at 20:38
  • Ideally two matches "history of", and "elephants" – Chris Bier Feb 07 '15 at 20:40
  • Looking into using `preg_split` I'm starting to think that might be a better solution – Chris Bier Feb 07 '15 at 20:45
  • Usually [stop words](http://en.wikipedia.org/wiki/Stop_words) like `a`, `an`, `of`, `by`, `as`, `at`[...](http://www.textfixer.com/resources/common-english-words.txt) are removed to improve performance and accuracy of a search function. If it's needed to find those, several indexes can be used for searching. – Jonny 5 Feb 07 '15 at 21:21
  • 1
    @Jonny5 the real use case is searching for a course like "Calculus A" in which case "A" is very important. – Chris Bier Feb 07 '15 at 21:26

2 Answers2

3

See if this would match your needs:

\b(?:[\w'-]{1,3}\W+[\w'-]{4,}|[\w'-]{4,}\W+[\w'-]{1,3}|[\w'-]{4,})\b
  • Starts at \b word boundaries where it...
  • [\w'-]{1,3}\W+[\w'-]{4,} matches 1-3 word characters, followed by \W+ one or more non-word characters, followed by [\w'-]{4,}\b 4 or more word characters.
  • |[\w'-]{4,}\W+[\w'-]{1,3} or matches first the 4+ words followed by shorter ones.
  • |[\w'-]{4,} or matches any words with at least 4 characters. (reduce if needed)

Test at regex101.com; Regex FAQ

Also see the problems if input is such as "I visted Calculus A, you in Calculus B?"; Outputs: I visted, Calculus A, in Calculus because of the priority of preceding words.


And a PHP-example ($out[0] would hold the matches)

$str = "
An elephant in the garden 
history of elephants
Algebra A B-movies";

$pattern = '~\b(?:
[\w\'-]{1,3}\W+[\w\'-]{4,}|
[\w\'-]{4,}\W+[\w\'-]{1,3}|
[\w\'-]{4,}
)\b~x';

if(preg_match_all($pattern, $str, $out)) {
  print_r($out[0]);
}

outputs to:

Array
(
    [0] => An elephant
    [1] => the garden
    [2] => history of
    [3] => elephants
    [4] => Algebra A
    [5] => B-movies
)

Test at eval.in (link expires soon)

Community
  • 1
  • 1
Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • 1
    Jonny thank you so much. I actually just found a solution myself. `/((\b\w{1,3}\s)+\w{4,})|(\w{4,}(\s\w{1,3}\b))|(\w{4,})/i` is the regex that I used and it is very similar to yours. – Chris Bier Feb 07 '15 at 22:34
  • In terms of your edit "I visted Calculus A, you in Calculus B?" That is quite alright in this present application because the searches are very fragmented. e.g. "History of American Science" will output "History of", "American", "Science" – Chris Bier Feb 07 '15 at 22:40
  • Thanks again for your persistence! – Chris Bier Feb 07 '15 at 22:40
  • I found that very interesting, you're welcome @ChrisB. Also your pattern seems to do the job fine. I was not sure if this is for a large text or phrase :) – Jonny 5 Feb 07 '15 at 22:41
1

There are some complications with what you're trying to do, it gives rise to ambiguities. Is History of elephants [History of] [elephants] or [History] [of elephants]? You're probably better of just excluding a set of specific stop words or words that meet some criteria.

If you want to exclude words of 3 or less characters, you might try the following. You say you're already splitting the keywords at spaces, so you should have an array of words. You can just array_filter that array based on word length (> 3 chars), and you should have the list of words you want to use.

$words = array('no', 'na', 'sure', 'definitely');

function length_filter($word) {
    return mb_strlen($word) > 3;
};

$longer_than_3 = array_filter($words, 'length_filter');
print_r($longer_than_3);

// Array
// (
//     [2] => sure
//     [3] => definitely
// )
Jon Surrell
  • 9,444
  • 8
  • 48
  • 54
  • Thanks for the answer, but I am trying to keep the words of 3 or less characters. I just need to merge them into the surrounding words of 4 or more characters. – Chris Bier Feb 07 '15 at 21:25
  • "History of elephants" can be split either way. I just need to group them in some way to prevent the search function from searching for all instances of "of". But I cannot omit "of" in this particular use case because it is valuable. The real use case is searching for course titles such as "Calculus A" – Chris Bier Feb 07 '15 at 21:28