1

I'm trying to find strings that contain a domain. I have the following pattern:

"|s:\\d+:\\\\\"((?:.(?!s:\\d+))+?){$domain}(.+?)\\\\\";|"

This (pattern) seems to work, but I get only the first two matches in PHP.

$filename = "caciki_tr.sql";
$domain   = "caciki.com.tr";

$domain   = escape($domain, ".");

$content = file_get_contents($filename);

$pattern = "|s:\\d+:\\\\\"((?:.(?!s:\\d+))+?){$domain}(.+?)\\\\\";|";

preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
print_r($matches);

function escape($string, $chars) {
    $chars = str_split($chars);
    foreach ($chars as $char) {
        $string = str_replace($char, "\\{$char}", $string);
    }
    return $string;
}

Array
(
    [0] => Array
        (
            [0] => s:121:\"/home/caciki/domains/caciki.com.tr/public_html/wp-content/themes/rafine/woocommerce/single-product/product-thumbnails.php\";
            [1] => /home/caciki/domains/
            [2] => /public_html/wp-content/themes/rafine/woocommerce/single-product/product-thumbnails.php
        )

    [1] => Array
        (
            [0] => s:81:\"/home/caciki/domains/caciki.com.tr/public_html/wp-content/themes/rafine/style.css\";
            [1] => /home/caciki/domains/
            [2] => /public_html/wp-content/themes/rafine/style.css
        )

)

I get the all matches (11) only when I tinker with the target file. Something must be breaking the pattern/PHP.

I've tested the same pattern in Python and C#, and they give the correct result:

enter image description here

enter image description here

So what's wrong here?

caciki_tr.sql (target file)


Update: The pattern here is used with different substrings (e.g., domain, url, username, etc.). Not all strings in the target file follows the same pattern. For example, a pattern for URLs should be able to match the following:

$url = "http://[DOMAIN_OMITTED]/~caciki";
$pattern = "|s:\d+:\\\\\"([^s]*(?:s(?!:\d)[^s]*)*){$url}(.+?)\\\\\";|";

s:28:\"http://[DOMAIN_OMITTED]/~caciki\";
s:28:\"<a href=\"http://[DOMAIN_OMITTED]/~caciki\">some page</a>\";

In short, there might not be a string between the s:28:\" and the substring ($url), or after the substring. So it should be optional.

akinuri
  • 10,690
  • 10
  • 65
  • 102
  • 1
    Your file is too big, I suspect. [Here](https://www.phpclasses.org/package/9697-PHP-Search-large-files-that-would-not-fit-in-memory.html) is an interesting solution in case you want to work with large files. BTW, `(?:.(?!s:\d+))+?` is very inefficient, you may use `[^s]*(?:s(?!:\d)[^s]*)*` to streamline it a bit. – Wiktor Stribiżew Sep 17 '18 at 12:51
  • Huh, it worked. I changed `(?:.(?!s:\d+))+?` into `[^s]*(?:s(?!:\d)[^s]*)*`, and now I get 11 matches... So you're saying it might be related to the filesize. I suppose, as the filesize increases, performance decreases, and at some point it fails?... I was using this just for debug purposes, and your suggestion is enough for now, but I'll definitely take a look at that page. – akinuri Sep 17 '18 at 13:11

1 Answers1

1

The current pattern is rather inefficient as it contains a corrupt "tempered greedy token", (?:.(?!s:\d+))+?. This is a very inefficient construct that should be "unwrapped" if you want to use such a regex in production.

You may use [^s]*(?:s(?!:\d)[^s]*)* instead of it:

"|s:\d+:\\\\\"([^s]*(?:s(?!:\d)[^s]*)*)$domain(.+?)\\\\\";|'
               ^^^^^^^^^^^^^^^^^^^^^^^

Details

  • [^s]* - 0+ chars other than s
  • (?: - a non-capturing group repeating...
    • s(?!:\d) - s not followed with : + a digit
    • [^s]* - 0+ chars other than s
  • )* - zero or more times.

Note that if you plan to work with big files make sure your patterns are as efficient as possible. Also, here is an interesting solution in case you want to work with large files (pcregrep is a very fast tool).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I've added an update to my question regarding the pattern change. It seems to fail in some cases, e.g. this is a match: `s:28:\"http://[DOMAIN_OMITTED]/~caciki\";s:13:\"image_classes\"`. – akinuri Sep 17 '18 at 14:39
  • @akinuri `[^s]*(?:s(?!:\d)[^s]*)*` can match an empty string. I think you have a problem with your `escape` method. Replace it with a mere `$domain = preg_quote($domain, "|");`. Besides, in the question you pass `$domain` to the pattern, in the edit part, it is `$url`. Please make sure you are not mixing up variables. – Wiktor Stribiżew Sep 17 '18 at 14:46
  • See: [regex101](https://regex101.com/r/ZrfnjH/1). When I debug these serialized variables (strings), I need to be able to get the surrounding strings (if there are any). Serialized string pattern is like this: `s:[STRING_LENGTH]:\"[STRING]\"` and `STRING` could be `str + substr (domain, url, etc) + str` or just the substring. You get the idea. In the regex demo, the second match captures more than necessary. – akinuri Sep 17 '18 at 15:06
  • Thank you. You've been a great help. – akinuri Sep 17 '18 at 15:31