1

How can i count the number of words between two words?

   $txt = "tükörfúrógép banana orange lime, tükörfúrógép cherry árvíztűrő orange lyon
    cat lime mac tükörfúrógép cat orange lime cat árvíztűrő
    tükörfúrógép banana orange lime
    orange lime cat árvíztűrő";

The two words: 'árvíztűrő' and 'tükörfúrógép'
I need this return:
tükörfúrógép cherry árvíztűrő
tükörfúrógép cat orange lime cat árvíztűrő
tükörfúrógép banana orange lime orange lime cat árvíztűrő

Now i have this regular expression:

preg_match_all('@((tükörfúrógép(.*)?árvíztűrő)(árvíztűrő(.*)?tükörfúrógép))@sui',$txt,$m);
Gumbo
  • 643,351
  • 109
  • 780
  • 844
turbod
  • 1,988
  • 2
  • 17
  • 31

3 Answers3

7

I have several things to point out:

  1. You can't do it in one regex. Regex is forward-only, reversed match order requires a second regex.
  2. You use (.*)?, but you mean (.*?)
  3. To aquire correct matches, you must ensure that the left boundary of your expression cannot occur in the middle.
  4. You should denote word boundaries (\b) around your delimiter words to ensure whole-word matches. EDIT: While this is correct in theory, it does not work for Unicode input in PHP.
  5. You should switch the PHP locale to Hungarian (it is Hungarian, right?) before calling preg_match_all(), because the locale has an influence on what's considered a word boundary in PHP. EDIT: The meaning of \b does in fact not change with the selected locale.

That being said, regex #1 is:

(\btükörfúrógép\b)((?:(?!\1).)*?)\bárvíztűrő\b

and regex #2 is analoguous, just with reversed delimiter words.

Regex explanation:

(               # match group 1:
  \b            #   a word boundary
  tükörfúrógép  #   your first delimiter word
  \b            #   a word boundary
)               # end match group 1
(               # match group 2:
  (?:           #   non-capturing group:
    (?!         #     look-ahead:
      \1        #       must not be followed by delimiter word 1
    )           #     end look-ahead
    .           #     match any next char (includes \n with the "s" switch)
  )*?           #   end non-capturing group, repeat as often as necessary
)               # end match group 2 (this is the one you look for)
\b              # a word boundary
árvíztűrő       # your second delimiter word
\b              # a word boundary

UPDATE: With PHP's patheticpoor Unicode string support, you will be forced to use expressions like these as replacements for \b:

$before = '(?<=^|[^\p{L}])';
$after  = '(?=[^\p{L}]|$)';

This suggestion has been taken from another question.

Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • This return empty array: Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) ) – turbod Jul 21 '10 at 07:39
  • PS: Well, to be completely honest - you *can* do it in one regex, by concatenating regex #1 and regex #2 like this `#1|#2`. It's up to you if you consider the resulting expression worthwhile. ;-) – Tomalak Jul 21 '10 at 07:43
  • @turbod: What does a simple `\árvíztűrő\b` give you? – Tomalak Jul 21 '10 at 07:45
  • I'm currently researching the way `\b` works with PHP PCRE and unicode strings. Looks like the locale does *not* have an influence, and an alternative must be used for "international" word boundaries. When I found something, I'll update my answer. – Tomalak Jul 21 '10 at 07:53
  • setLocale(LC_ALL, 'hu_HU.utf8'); preg_match_all('@\bárvíztűrő\b@',$txt,$m); print_r($m); This return empty array. – turbod Jul 21 '10 at 07:54
  • @turbod: Yeah, as I said that's because `\b` does not change meaning based on the locale. Take out all `\b` and try again. – Tomalak Jul 21 '10 at 08:00
  • 1
    Thanks Tomalak! This expression is work! ((?<!\pL)tükörfúrógép(?!\pL))((?:(?!\1).)*?)(?<!\pL)árvíztűrő(?!\pL)|((?<!\pL)árvíztűrő(?!\pL))((?:(?!\1).)*?)(?<!\pL)tükörfúrógép(?!\pL) – turbod Jul 21 '10 at 08:02
  • @turbod: Your look-around for Unicode letters is *almost* correct - it does not account for start-of-string and end-of-string conditions. See my update. – Tomalak Jul 21 '10 at 08:10
3

To count words between two words you can easily use:

count(split(" ", "lime orange banana"));

And a function that returns an array with matches and counts will be:

function count_between_words($text, $first, $second, $case_sensitive = false)
{
    if(!preg_match_all('/('.$first.')((?:(?!\\1).)*?)'.$second.'/s' . ($case_sensitive ? "" : "i"), preg_replace("/\\s+/", " ", $text), $results, PREG_SET_ORDER))
        return array();

    $data = array();

    foreach($results as $result)
    {
        $result[2] = trim($result[2]);
        $data[] = array("match" => $result[0], "words" => $result[2], "count" => count(split(" ", $result[2])));
    }

    return $data;
}

$result = count_between_words($txt, "tükörfúrógép", "árvíztűrő");

echo "<pre>" . print_r($result, true) . "</pre>";

Result will be:

Array
(
    [0] => Array
    (
        [match] => tükörfúrógép cherry árvíztűrő
        [words] => cherry
        [count] => 1
    )

    [1] => Array
    (
        [match] => tükörfúrógép cat orange lime cat árvíztűrő
        [words] => cat orange lime cat
        [count] => 4
    )

    [2] => Array
    (
        [match] => tükörfúrógép banana orange lime orange lime cat árvíztűrő
        [words] => banana orange lime orange lime cat
        [count] => 6
    )
)
Wiliam
  • 3,714
  • 7
  • 36
  • 56
  • Thanks William! Is great! But what happens if you reverse the order of the parameters? For example: $result = count_between_words($txt, "árvíztűrő","tükörfúrógép"); – turbod Jul 21 '10 at 08:09
  • Search the reverse is not a logic error, is a completely different search. Why? :o – Wiliam Jul 21 '10 at 08:17
  • +1 for providing a self-contained solution. The regex however needs some improvement because it makes assumptions that may or may not be true (namely: `\s*` and `[^,]+?`) and can produce false negatives because of this. – Tomalak Jul 21 '10 at 08:25
  • Reverse will return: " árvíztűrő orange lyon cat lime mac tükörfúrógép" (5) and "árvíztűrő tükörfúrógép" (0) – Wiliam Jul 21 '10 at 08:25
  • Tomalak, I used \s* to trim the result contained in ([^,]+?) but you are right, seeing the example he gave us and thinking in a normal human redacted post this will be ok, errors can be easily fixed. With [^,] is the same point, in human redacted texts coma separates orations and if you don't use it in this example will return a false positive. (Ah! Thanks for the point!) – Wiliam Jul 21 '10 at 08:29
  • I think that assuming that a comma is a significant delimiter in a complex, human-produced text is putting to much faith in the grammatical abilities of the average human. ;-) The question stated "between these two words", and as long as the definition is not more precise, I would refrain from making assumptions about the nature of the input. :-) *(PS: This site uses a Twitter style @-reply system. Unless you use it, your comment might go unnoticed by the one you are talking to.)* – Tomalak Jul 21 '10 at 08:38
  • @Tomalak, yes, I saw that after my last comment. In response of your comment, you are right again, I improved the function with your regex, I learned today what (?!) makes in regex :D – Wiliam Jul 21 '10 at 08:55
  • @turbod, why you want reverse it? – Wiliam Jul 21 '10 at 08:57
  • First day, already learned something. Good start. :-) *(PS, again: Check out http://meta.stackexchange.com/questions/38600/ for a way to make comment replies easy.)* – Tomalak Jul 21 '10 at 08:57
  • @Tomalak: Ok, I installed the fast reply script, i need it hehe – Wiliam Aug 20 '10 at 10:49
1

Instead of a huge, confusing regexp, why not write a few lines using various string functions?

Example:

$start = strpos($txt, 'árvíztűrő') + 9; // position of first char after 'árvíztűrő'
$end   = strpos($txt, 'tükörfúrógép', $start);
$inner = substr($txt, $start, $end - $start);
$words = preg_split("/[\s,]+/", $inner);
$num   = count($words);

Of course, this will eat up memory if you have some gigantic input string...

Kricket
  • 4,049
  • 8
  • 33
  • 46
  • Ah - what did it do? Looking at it now, a possible problem that comes to mind is that your funny accented characters probably aren't in the ASCII set and so the length of 'árvíztűrő' may be more than 9... – Kricket Jul 21 '10 at 10:11