1

I'm just trying my hand at crafting my very first regex. I want to be able to match a pseudo HTML element and extract useful information such as tag name, attributes etc.:

$string = '<testtag alpha="value" beta="xyz" gamma="abc"  >';

if (preg_match('/<(\w+?)(\s\w+?\s*=\s*".*?")+\s*>/', $string, $matches)) {
    print_r($matches);
}

Except, I'm getting:

Array ( [0] =>  [1] => testtag [2] => gamma="abc" ) 

Anyone know how I can get the other attributes? What am I missing?

Guillermo Phillips
  • 2,176
  • 1
  • 23
  • 40
  • 1
    Your very first regex should not be for matching HTML/XML, as this is the one thing that regexes are genuinely bad at. Believe me, they suck at it, and you should avoid using them for it right from the start. – Tomalak Jul 06 '09 at 15:59
  • But you have to admit it's a good way to learn their limitations. ;) – Alan Moore Jul 06 '09 at 18:04
  • Probably, yes. ;-) It's easy to develop an "anything goes" attitude with regex, making you think that everything that is represented as text *is* text. XML and HTML are not text, they are structured data, and should be processed with data tools, not text tools. Best time to present the warning is when someone just begins with regex. :) – Tomalak Jul 07 '09 at 08:28
  • Thanks to all the people who tried to answer my question. It's looking like it's not possible to do it the way I wanted. Bah humbug! Why use one line of code when you can use twenty or even a whole library? Down with PHP, long live .NET! – Guillermo Phillips Jul 11 '09 at 14:45

3 Answers3

3

Try this regular expression:

/<(\w+)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*'|[^'">\s]*))*)\s*>/

But you really shouldn’t use regular expressions for a context free language like HTML. Use a real parser instead.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • Care to elaborate on what you mean my 'real parser'? – Tim Lytle Jul 06 '09 at 15:56
  • 2
    @Tim Lytle: Regexes are no parsers. They are *part of parsers*, at most. A real parser is an XML DOM parser, for example - it can parse languages, whereas regexes can only find patterns. – Tomalak Jul 06 '09 at 16:03
  • @Tomalak Ah, did not understand what he meant. Makes perfect sense now. – Tim Lytle Jul 27 '09 at 17:08
1

As has been said, don't use RegEx for parsing HTML documents.

Try this PHP parser instead: http://simplehtmldom.sourceforge.net/

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
0

Your second capturing group matches the attributes one at a time, each time overwriting the previous one. If you were using .NET regexes, you could use the Captures array to retrieve the individual captures, but I don't know of any other regex flavor that has that feature. Usually you have to do something like capture all of the attributes in one group, then use another regex on the captured text to break out the individual attributes.

This is why people tend to either love regexes or hate them (or both). You can do some truly amazing things with them, but you also keep running into simple tasks like this one that are ridiculously hard, if not impossible.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156