Simple regex seems to cause infinite loop in PHP

Question

The following 2 lines are my code:

$rank_content = file_get_contents('https://www.championsofregnum.com/index.php?l=1&ref=gmg&sec=42&world=2');
$tmp_ = preg_replace("/.+width=.16.> /Uis", "", $rank_content, 1);

The second line above causes an infinite loop. In contrary, the following alternatives DO work:

$tmp_ = preg_replace("/.+width=.16.> /Ui", "", $rank_content, 1);
$tmp_ = preg_replace("/[^§]+width=.16.> /Uis", "", $rank_content, 1);

But sadly, they do not give me what I want - both alternatives do not include line breaks within $rank_content.

Also, if I replaced the file_get_contents function with something like

$rank_content = "asdfas\nasdfasdfaswidth=m16m> teststring";

There are no problems either, although \n represents a line break, too, doesn’t it?!

So do I understand it right that RegEx has problems in noticing a String with line breaks in it?

How can I filter a substring of $rank_content (which has multiple lines in it) by removing some lines until something like "width="16" " appears? (Can be seen in the site's source code)

No, `\n` represents line breaks only in double quoted string. — Marek, Jun 26 '14 at 15:31
I don't see anything on the linked page that matches `width=.16.>`. Was that a mistake? — Mr. Llama, Jun 26 '14 at 15:56
the source code of the page tells me there are a bunch of phrases like "realm." width="16" src="include..."" — phil294, Jun 26 '14 at 16:53
It seems the problem is the LENGTH of the haystack variable $rank_content. Its length is about 90,000, while the maximum allowed length for regex match() is about 30,000. For those interested: http://stackoverflow.com/questions/8268624/php-preg-match-all-limit I myself am going to solve the problem using another method for reading the contents of a website like HTML Unit. — phil294, Jun 26 '14 at 20:02
What you've got here is an [x/y problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). You haven't described what you're trying to do and have focussed entirely on the problems of the solution you've chosen to do it. Also, the description is quite misleading - there is no infinite loop, it's very slow, but it'll probably complete if you leave it long enough (it did for me); and it's so slow because of the regex you're using. — AD7six, Jun 26 '14 at 23:51

score 0 · Answer 1 · edited May 23 '17 at 10:26

0

Replace the m modifier with the s modifier. m changes the behaviour of ^ and $, whereas s changes the behaviour of .

That said, you should not be parsing HTML with regex. Seriously. Bad things happen.

edited May 23 '17 at 10:26

Community

1
1

answered Jun 26 '14 at 15:33

Niet the Dark Absol

320,036
81
464
592

yes, sorry, s was what I also used in my code. Edited it, problem remains. – phil294 Jun 26 '14 at 15:36
Everyone always links to that post, but never the one right under it.... http://stackoverflow.com/a/1733489/485418 – Samsquanch Jun 26 '14 at 17:59
why shouldn't I use regular expressions in order to retrieve some values from a website? Are there better ways to do so? – phil294 Jun 26 '14 at 18:05
edit. I'll have a look at HTMLUnit library. Does not solve the problem though. – phil294 Jun 26 '14 at 18:06

score 0 · Accepted Answer · edited May 23 '17 at 10:26

I give up on it: It seems the problem is the LENGTH of the haystack variable $rank_content. Its length is about 90,000, while the maximum allowed length for regex match() is about 30,000, so I guess it is the same for regex replace(). Solving this problem would surely be possible, if somebody is interested: Have a look into this link -> PHP preg_match_all limit

I myself am going to solve the problem using another method for reading the contents of a website like HTML Unit or maybe retrieving the site line after line.

Simple regex seems to cause infinite loop in PHP

2 Answers2