Optimized regex for N words around a given word (UTF-8)

Question

I'm trying to find an optimized regex to return the N words (if available) around another one to build a summary. The string is in UTF-8, so the definition of "words" is larger than just [a-z]. The string that serves as the reference word could be in the middle of a word or not directly surrounded by spaces.

I've already got the following that works but seems actually greedy and chokes when looking for more than 6-7 words around another one:

/(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,4}lorem(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,4}/u

This is the PHP method I've build to do that but I'd need help getting the regex to be less greedy and work for any number of words around.

/**
 * Finds N words around a specified word in a string.
 *
 * @param string $string The complete string to look in.
 * @param string $find The string to look for.
 * @param integer $before The number of words to look for before $find.
 * @param integer $after The number of words to look for after $find.
 * @return mixed False if $find was not found and all the words around otherwise.
 */
private function getWordsAround($string, $find, $before, $after)
{
    $matches = array();
    $find = preg_quote($find);
    $regex = '(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,' . (int)$before . '}' .
        $find . '(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,' . (int)$after . '}';
    if (preg_match("/$regex/u", $string, $matches)) {
        return $matches[0];
    } else {
        return false;
    }
}

If I had the following $string:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum 
eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu 
fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus 
luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. 
Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."

And called getWordsAround($string, 'vitae', 8, 8) I'd want to get the following result:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit,"

Thank you for your help regex gurus.

For starters, `\s` includes `\r` and `\n`, so adding them to the same character class is superfluous. Also `[^\s]` is equivalent to `\S` — NullUserException, Aug 27 '10 at 16:27
This is an interesting problem by the way. When I get back I'll try and come up with a better solution. +1 — NullUserException, Aug 27 '10 at 17:38
@NullUserException Thanks! I'm having fun working on it too. Tell me if you find a better solution, in the meantime, I'll see if I can come up with something too, I've got some nice ideas from what was said below. — lpfavreau, Aug 28 '10 at 13:49

a'r · Answer 1 · 2010-08-27T16:28:30.013

2

What about using a regex or some other method to split the input text into an array of words. Then run through the words with a loop looking for the target word. Once it's found, then grab the required array slice, join it together and print.

To maintain the original whitespace between words, you can include it at the end of each word.

Also, this could be implemented as a stream parser rather than splitting the whole string first.

edited Aug 27 '10 at 16:28

answered Aug 27 '10 at 16:16

a'r

35,921
7
66
67

1

I like the idea on paper, but when you get to implement you'll run into roadblocks (eg: how should you join the pieces back together while maintaining their original separators)? – NullUserException Aug 27 '10 at 16:26
@NullUserException, you could include the whitespace with the word token or implement a stream parser that saves the last N word boundaries as it goes through the string. – a'r Aug 27 '10 at 16:29
If he's not using regular expressions, he might as well loop through the string until he finds the word he wants and then go backwards and forward to find the surrounding words. It will be faster and certainly more memory efficient. – Artefacto Aug 27 '10 at 16:29
1

He doesn't even have to to backwards. He can save the position of the last word before in a circular array with size `n` (where `n` is the number of words he wants to keep before the match) as he traverses the string. – Artefacto Aug 27 '10 at 16:36

wuputah · Answer 2 · 2010-08-27T17:12:54.493

2

As mentioned earlier, the problem is a very large amount of backtracking. To solve this, I tried to use lookbehind and lookahead to anchor the match to the string. So I came up with:

/consectetur(?<=((?:\S+\s+){0,8})\s*consectetur)\s*(?=((?:\S+\s+){0,8}))/

Unfortunately, this does not work, as variable length lookbehinds are not supported in PCRE (or perl for that matter). So we are left with:

/consectetur\s*(?:\S+\s+){0,8}/

Which only captures the match string and up to 8 words after the match. However, if you use the PREG_OFFSET_CAPTURE flag, get the offset of $match[0], take the substring up to that point, reverse the string with strrev, get the first 0-8 words (using /\s*(?:\S+\s+){0,8}/), reverse the match, and recombine:

$s = "put test string here";
$matches = array();
if (preg_match('/consectetur\s*(?:\S+\s+){0,8}/', $s, $matches, PREG_OFFSET_CAPTURE)) {
  $before = strrev(substr($s, 0, $matches[0][1]));
  $before_match = array();
  preg_match('/\s*(?:\S+\s+){0,8}/', $before, $before_match);
  echo strrev($before_match[0]) . $matches[0][0];
}

You can make it a bit faster on very big strings by taking a safe subset of characters prior to the match, like 100. Then you are only reversing a 100 character string.

All that being said, a solution which doesn't use regular expressions may work better.

edited Aug 27 '10 at 17:12

answered Aug 27 '10 at 16:50

wuputah

11,285
1
43
60

Edited to add actual PHP code. Seems to work great on the test string. – wuputah Aug 27 '10 at 17:12
I think there I've read somewhere there is a problem with PREG_OFFSET_CAPTURE because it returns the byte offset instead of the actual number of characters and strrev is not multibyte compatible. This would work great on a latin-1 string but not UTF-8 I'm afraid. And reversing UTF-8 in PHP is not efficient, at least the functions I've tried. – lpfavreau Aug 27 '10 at 17:36
You actually want the byte offset for `substr`, not the character offset. As far reversing strings in UTF-8, the efficiency of such code could be made quite negligible if you establish a reasonable `substr` length to capture, e.g. `($before * 20)` bytes. Any encoding issues would be at the beginning of the string, which should be cut off when you match `$before` words. – wuputah Aug 27 '10 at 17:47
@lpfavreau There's no reason reversing in PHP shouldn't be efficient. – Artefacto Aug 28 '10 at 02:04
@Artefacto You need to write a C function for [mb_strrev](http://stackoverflow.com/questions/434250/how-to-reverse-a-unicode-string). :) – wuputah Aug 29 '10 at 03:17
@wuputah What I might do is to write an efficient PHP function for that. The accepted answer is inefficient. – Artefacto Aug 29 '10 at 03:23
@Artefacto There is still no accepted answer although there are lots of nice potential solutions. Still looking for an efficient PHP-only solution (although efficient will be pale compared to your C function), ideally in regex to learn from it. :-) – lpfavreau Aug 29 '10 at 16:53
@Ipfavreau Artefacto was referring to the accepted answer on the mb_strrev question. As for this question, I doubt you'll see any other answers posted. – wuputah Aug 29 '10 at 18:17

Artefacto · Answer 3 · 2010-08-28T16:11:50.670

Here is an internal PHP function that does what you want. It's unlikely you'll be able to beat this performance-wise in a user-land function.

There should be no problem using this for UTF-8 functions, since '\r', '\n' and ' ' (and in general all the ASCII characters) cannot appear as part of another character sequence. So if you pass valid UTF-8 data to both parameters you should be fine. Reversing UTF-8 data as you would normally reverse single character encodings (with strrev) would indeed mean trouble, but this function doesn't do that.

PHP_FUNCTION(surrounding_text)
{
    struct circ_array {
        int *offsets;
        int cur;
        int size;
    } circ_array;
    long before;
    long after;
    char *haystack, *needle;
    int haystack_len, needle_len;
    int i, in_word = 0, in_match = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ssll",
        &haystack, &haystack_len, &needle, &needle_len, &before, &after) 
        == FAILURE)
        return;

    if (needle_len == 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Cannot have empty needle");
        return;
    }

    if (before < 0 || after < 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Number of words after and before should be non-negative");
        return;
    }

    /* saves beggining of match and words before */
    circ_array.offsets = safe_emalloc(before + 1, sizeof *circ_array.offsets, 0);
    circ_array.cur = 0;
    circ_array.size = before + 1;

    for (i = 0; i < haystack_len; i++) {
        if (haystack[i] == needle[in_match]) {
            in_match++;
            if (!in_word) {
                in_word = 1;
                circ_array.offsets[circ_array.cur % circ_array.size] = i;
                circ_array.cur++;
            }
            if (in_match == needle_len)
                break; /* found */
        } else {
            int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';

            if (in_match)
                in_match = 0;

            if (is_sep) {
                if (in_word)
                    in_word = 0;
            } else { /* not a separator */
                if (!in_word) {
                    in_word = 1;
                    circ_array.offsets[circ_array.cur % circ_array.size] = i;
                    circ_array.cur++;
                }
            }
        }
    }

    if (in_match != needle_len) {
        efree(circ_array.offsets);
        RETURN_FALSE;
    }


    /* find words after; in_word is 1 */
    for (i++; i < haystack_len; i++) {
        int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';
        if (is_sep) {
            if (in_word) {
                if (after == 0)
                    break;
                after--;
                in_word = 0;
            }
        } else {
            if (!in_word)
                in_word = 1;
        }
    }

    {
        char *result;
        int start, result_len;
        if (circ_array.cur < circ_array.size)
            start = circ_array.offsets[0];
        else
            start = circ_array.offsets[circ_array.cur % circ_array.size];

        result_len = i - start;
        result = emalloc(result_len + 1);
        memcpy(result, &haystack[start], result_len);
        result[result_len] = '\0';

        efree(circ_array.offsets);
        RETURN_STRINGL(result, result_len, 0);
    }

}

From my tests, the C function is 4 times faster than wuputah's version (and doesn't have the problem of strrev).

Wow, this is impressive. +1 for probably finding the fastest way to resolve this problem. I didn't have the time to test it, in fact, I've never compiled my own PHP function and I'm not sure it'll be convenient for its distribution, but nonetheless, it doesn't remove anything to how you resolved that problem. I'm still looking for a PHP-only solution but this should get points anyway! Thanks! — lpfavreau, Aug 28 '10 at 13:42
By the way, when declaring is_sep, you check twice for '\n' so I guess you could remove one check there. — lpfavreau, Aug 28 '10 at 13:43

Artefacto · Answer 4 · 2010-08-27T16:39:19.873

1

This worked fine here:

(?:[^\s\r\n]*[\s\r\n]+){0,8}(?:[^\s\r\n]*)consectetur(?:[^\s\r\n]*)(?:[\s\r\n]+[^\s\r\n]*){0,8}

Gives:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit,

The performance of this regular expression, however, is absolute crap. I really don't know how to make this more efficient, short of doing it without regular expressions.

The reason for the performance being "absolute crap" for words near the end is that engine tries to start a match on every character and then advances several dozens of characters until it finds out that, in the end, it cannot find the string you're looking for and discards everything.

edited Aug 27 '10 at 16:39

answered Aug 27 '10 at 15:57

Artefacto

96,375
17
202
225

Bad example from my part, sorry for that. Try it with the word vitae. I don't know why but when it's further away in the string, it seems to become very very slow. – lpfavreau Aug 27 '10 at 16:35
Ah, didn't see the edit. I know I could do it without regex but I'd still want to see if someone has an idea so I can learn from it. +1 for the plain word explanation on why the performance is absolute crap. :-) – lpfavreau Aug 27 '10 at 16:43

NullUserException · Answer 5 · 2010-08-27T16:55:42.103

The problem with using this regex is that it causes the regex engine to catastrophically backtrack. The number of attempts increases exponentially with the size of the string, and that is no good. You might want to look into atomic grouping to improve performance.

Alternatively you could find the first occurrence of the given word and start looking backwards and forwards for words up to the desired length. Pseudo-ish code:

$pos = strpos($find);
$result = $find;

foreach $word before $pos {
    $result = $word . $result;
    $count++
    if ($count >= $target)
        break;
}

foreach $word after $pos {
    $result .= $word;
    $count++
    if ($count >= $target)
        break;
}

Of course finding the words before and after, and handling partial strings can get really messy.

You should use a circular array like I said in the comment to ar's answer. It's inefficient to traverse a UTF-8 string backwards and very efficient to do it forward. — Artefacto, Aug 27 '10 at 16:59

Optimized regex for N words around a given word (UTF-8)

5 Answers5