How can i count the number of words between two words?

Question

   $txt = "tükörfúrógép banana orange lime, tükörfúrógép cherry árvíztűrő orange lyon
    cat lime mac tükörfúrógép cat orange lime cat árvíztűrő
    tükörfúrógép banana orange lime
    orange lime cat árvíztűrő";

The two words: 'árvíztűrő' and 'tükörfúrógép'
I need this return:
tükörfúrógép cherry árvíztűrő
tükörfúrógép cat orange lime cat árvíztűrő
tükörfúrógép banana orange lime orange lime cat árvíztűrő

Now i have this regular expression:

preg_match_all('@((tükörfúrógép(.*)?árvíztűrő)(árvíztűrő(.*)?tükörfúrógép))@sui',$txt,$m);

score 7 · Accepted Answer · edited May 23 '17 at 10:27

7

I have several things to point out:

You can't do it in one regex. Regex is forward-only, reversed match order requires a second regex.
You use (.*)?, but you mean (.*?)
To aquire correct matches, you must ensure that the left boundary of your expression cannot occur in the middle.
~~You should denote word boundaries (\b) around your delimiter words to ensure whole-word matches.~~ EDIT: While this is correct in theory, it does not work for Unicode input in PHP.
~~You should switch the PHP locale to Hungarian (it is Hungarian, right?) before calling preg_match_all(), because the locale has an influence on what's considered a word boundary in PHP.~~ EDIT: The meaning of \b does in fact not change with the selected locale.

That being said, regex #1 is:

(\btükörfúrógép\b)((?:(?!\1).)*?)\bárvíztűrő\b

and regex #2 is analoguous, just with reversed delimiter words.

Regex explanation:

(               # match group 1:
  \b            #   a word boundary
  tükörfúrógép  #   your first delimiter word
  \b            #   a word boundary
)               # end match group 1
(               # match group 2:
  (?:           #   non-capturing group:
    (?!         #     look-ahead:
      \1        #       must not be followed by delimiter word 1
    )           #     end look-ahead
    .           #     match any next char (includes \n with the "s" switch)
  )*?           #   end non-capturing group, repeat as often as necessary
)               # end match group 2 (this is the one you look for)
\b              # a word boundary
árvíztűrő       # your second delimiter word
\b              # a word boundary

UPDATE: With PHP's ~~pathetic~~poor Unicode string support, you will be forced to use expressions like these as replacements for \b:

$before = '(?<=^|[^\p{L}])';
$after  = '(?=[^\p{L}]|$)';

This suggestion has been taken from another question.

edited May 23 '17 at 10:27

Community

1
1

answered Jul 21 '10 at 07:24

Tomalak

332,285
67
532
628

This return empty array: Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) ) – turbod Jul 21 '10 at 07:39
PS: Well, to be completely honest - you *can* do it in one regex, by concatenating regex #1 and regex #2 like this `#1|#2`. It's up to you if you consider the resulting expression worthwhile. ;-) – Tomalak Jul 21 '10 at 07:43
@turbod: What does a simple `\árvíztűrő\b` give you? – Tomalak Jul 21 '10 at 07:45
I'm currently researching the way `\b` works with PHP PCRE and unicode strings. Looks like the locale does *not* have an influence, and an alternative must be used for "international" word boundaries. When I found something, I'll update my answer. – Tomalak Jul 21 '10 at 07:53
setLocale(LC_ALL, 'hu_HU.utf8'); preg_match_all('@\bárvíztűrő\b@',$txt,$m); print_r($m); This return empty array. – turbod Jul 21 '10 at 07:54
@turbod: Yeah, as I said that's because `\b` does not change meaning based on the locale. Take out all `\b` and try again. – Tomalak Jul 21 '10 at 08:00
1

Thanks Tomalak! This expression is work! ((?<!\pL)tükörfúrógép(?!\pL))((?:(?!\1).)*?)(?<!\pL)árvíztűrő(?!\pL)|((?<!\pL)árvíztűrő(?!\pL))((?:(?!\1).)*?)(?<!\pL)tükörfúrógép(?!\pL) – turbod Jul 21 '10 at 08:02
@turbod: Your look-around for Unicode letters is *almost* correct - it does not account for start-of-string and end-of-string conditions. See my update. – Tomalak Jul 21 '10 at 08:10

Wiliam · Answer 2 · 2010-07-21T08:54:08.950

3

To count words between two words you can easily use:

count(split(" ", "lime orange banana"));

And a function that returns an array with matches and counts will be:

function count_between_words($text, $first, $second, $case_sensitive = false)
{
    if(!preg_match_all('/('.$first.')((?:(?!\\1).)*?)'.$second.'/s' . ($case_sensitive ? "" : "i"), preg_replace("/\\s+/", " ", $text), $results, PREG_SET_ORDER))
        return array();

    $data = array();

    foreach($results as $result)
    {
        $result[2] = trim($result[2]);
        $data[] = array("match" => $result[0], "words" => $result[2], "count" => count(split(" ", $result[2])));
    }

    return $data;
}

$result = count_between_words($txt, "tükörfúrógép", "árvíztűrő");

echo "<pre>" . print_r($result, true) . "</pre>";

Result will be:

Array
(
    [0] => Array
    (
        [match] => tükörfúrógép cherry árvíztűrő
        [words] => cherry
        [count] => 1
    )

    [1] => Array
    (
        [match] => tükörfúrógép cat orange lime cat árvíztűrő
        [words] => cat orange lime cat
        [count] => 4
    )

    [2] => Array
    (
        [match] => tükörfúrógép banana orange lime orange lime cat árvíztűrő
        [words] => banana orange lime orange lime cat
        [count] => 6
    )
)

edited Jul 21 '10 at 08:54

answered Jul 21 '10 at 08:00

Wiliam

3,714
7
36
56

Thanks William! Is great! But what happens if you reverse the order of the parameters? For example: $result = count_between_words($txt, "árvíztűrő","tükörfúrógép"); – turbod Jul 21 '10 at 08:09
Search the reverse is not a logic error, is a completely different search. Why? :o – Wiliam Jul 21 '10 at 08:17
+1 for providing a self-contained solution. The regex however needs some improvement because it makes assumptions that may or may not be true (namely: `\s*` and `[^,]+?`) and can produce false negatives because of this. – Tomalak Jul 21 '10 at 08:25
Reverse will return: " árvíztűrő orange lyon cat lime mac tükörfúrógép" (5) and "árvíztűrő tükörfúrógép" (0) – Wiliam Jul 21 '10 at 08:25
Tomalak, I used \s* to trim the result contained in ([^,]+?) but you are right, seeing the example he gave us and thinking in a normal human redacted post this will be ok, errors can be easily fixed. With [^,] is the same point, in human redacted texts coma separates orations and if you don't use it in this example will return a false positive. (Ah! Thanks for the point!) – Wiliam Jul 21 '10 at 08:29
I think that assuming that a comma is a significant delimiter in a complex, human-produced text is putting to much faith in the grammatical abilities of the average human. ;-) The question stated "between these two words", and as long as the definition is not more precise, I would refrain from making assumptions about the nature of the input. :-) *(PS: This site uses a Twitter style @-reply system. Unless you use it, your comment might go unnoticed by the one you are talking to.)* – Tomalak Jul 21 '10 at 08:38
@Tomalak, yes, I saw that after my last comment. In response of your comment, you are right again, I improved the function with your regex, I learned today what (?!) makes in regex :D – Wiliam Jul 21 '10 at 08:55
@turbod, why you want reverse it? – Wiliam Jul 21 '10 at 08:57
First day, already learned something. Good start. :-) *(PS, again: Check out http://meta.stackexchange.com/questions/38600/ for a way to make comment replies easy.)* – Tomalak Jul 21 '10 at 08:57
@Tomalak: Ok, I installed the fast reply script, i need it hehe – Wiliam Aug 20 '10 at 10:49

score 1 · Answer 3 · answered Jul 21 '10 at 07:35

1

Instead of a huge, confusing regexp, why not write a few lines using various string functions?

Example:

$start = strpos($txt, 'árvíztűrő') + 9; // position of first char after 'árvíztűrő'
$end   = strpos($txt, 'tükörfúrógép', $start);
$inner = substr($txt, $start, $end - $start);
$words = preg_split("/[\s,]+/", $inner);
$num   = count($words);

Of course, this will eat up memory if you have some gigantic input string...

answered Jul 21 '10 at 07:35

Kricket

4,049
8
33
46

Ah - what did it do? Looking at it now, a possible problem that comes to mind is that your funny accented characters probably aren't in the ASCII set and so the length of 'árvíztűrő' may be more than 9... – Kricket Jul 21 '10 at 10:11

How can i count the number of words between two words?

3 Answers3