0

I would like to grab value updateXXXX from the following HTML code using cURL:

<input type="hidden" id="_postupdate" name="_postupdate" value="updateXXXX" /><input type="hidden"(...)

I tried

$regex = '/name="_postupdate" value="(.*?)" \/><input type="hidden"/s';
if ( preg_match($regex, $page, $list) )
echo $list[0];

but without success. Any advice? :) Thanks

Kris
  • 1,067
  • 2
  • 12
  • 15
  • 5
    I would say: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Christian Kuetbach Aug 28 '13 at 20:52
  • 1
    `DOMDocument`, `->getElementById()`, `->getAttribute()`, done. – Wrikken Aug 28 '13 at 20:55
  • @ChristianKuetbach: Please don't post links to that question, because they are not helpful to the reader, unless you follow it up with something that is an answer they can use. *You* know the point of the comment and that wall of text is that parsing HTML with regexes is a bad idea. However, to someone else who is asking, that is not at all clear. Worse, it doesn't point the reader to any useful solutions that *can* help parse HTML reliably. – Andy Lester Aug 28 '13 at 21:00
  • **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Aug 28 '13 at 21:01
  • @AndyLester That link is not really helpful. In this point you may be right, BUT: I think there is huge amount of questions about regex an html here. And every single day, the number increases. I think the only really helpful answer would be deleting most of these questions. – Christian Kuetbach Aug 28 '13 at 21:15
  • 1
    @ChristianKuetbach: My work to stem the tide is http://htmlparsing.com, which I'm trying to make into a one-stop shop for dealing with these. I welcome additions, if you'd like to help. Github repo is at https://github.com/petdance/htmlparsing/ – Andy Lester Aug 28 '13 at 23:32

2 Answers2

4

Don't cripple yourself parsing HTML with regexps! Instead, let an HTML parser library worry about the structure of the markup for you.

You might want to use the DOMDocument class to do this. Then, you can use XPath queries to extract the data.

You could use something like this:

$html = '<input type="hidden" id="_postupdate" name="_postupdate" value="updateXXXX" />';


$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//input[@name="_postupdate"]');
foreach ($tags as $tag) {
    var_dump(trim($tag->getAttribute('value')));
}
Farhan Ahmad
  • 5,148
  • 6
  • 40
  • 69
  • Isn't a html parser overkill here ? Use regex for simple match and a html parser for a difficult one. Always do things the easiest way. – Lorenz Meyer Aug 28 '13 at 21:13
  • @LorenzMeyer See Andy's comment above. – Farhan Ahmad Aug 29 '13 at 13:11
  • Why do you think matching part of an tag is "simple"? What if one day the `` comes to you as ``? See http://htmlparsing.com/regexes for more examples of what can make your "simple" task be not so simple. – Andy Lester Aug 29 '13 at 13:15
  • 1
    @AndyLester Thanks, I understand now why you insist as much on not using regex for this purpose. It's all about error-resistent and future-proof programming. – Lorenz Meyer Aug 29 '13 at 13:21
  • @LorenzMeyer: Exactly. If you're parsing one single HTML file that you know is never going to change, then sure, go ahead and regex it. But if you're dealing with data from the outside, then future-proofing is exactly the concern. Thanks for the reminder of the word "future-proofing". I need to use it more. – Andy Lester Aug 29 '13 at 13:31
0

You either use the ungreedy switch like this :

$regex = '/name="_postupdate" value="(.*)" \/><input type="hidden"/Us';

Or you exclude quotes like this :

$regex = '/name="_postupdate" value="([^"]*)" \/><input type="hidden"/s';

I agree that in a general case it is not recommended to use regex to parse html. In this case the text to match is well defined and simple.

Regex are faster than a html parser, but they will fail if there is a minor change in the html code. One must be aware of this weakness while using regex and refrain from it if there is the slightest chance that the code might evolve over time.

Lorenz Meyer
  • 19,166
  • 22
  • 75
  • 121