0

Generally I'd match HTML attributes with this regex

\w+=".*?"

but when the HTML contains PHP code it gets kind of dicy. Please consider the following tag:

<option value="<?php echo $img; ?>"<?php echo ($hpb[$i]['image_filename']==$img?' selected="selected"':''); ?>>
    <?php echo $img; ?>
</option>

the above regex will match the attribute selected="selected" which is determined inside PHP logic. Is there a way to match attributes which are not inside PHP tags while still matching the ones whose value may contain PHP logic? If not could I just remove the PHP code which isn't part of an attribute value?

EDIT: Here's what I have so far:

 \w+="(((.(?!<\?php))*?)|((.((?=<\?php).*?(?=\?>))*)*?))*"

Which basically means match a string which starts with a SPACE then greedily match alphanumeric characters followed by EQUALS sign followed by double quote and then match any of the following two while capturing as many characters as possible:

  1. A sequence of characters which does not contain the string <?php
  2. A sequence of characters containing the pattern <\?php.*?\?> or in other words greedily match the value part of the attribute with all of its PHP code All of that till a closing double quote is encountered...
Deduplicator
  • 44,692
  • 7
  • 66
  • 118
CodeFan
  • 89
  • 1
  • 10
  • what u exactly want?? why using so complex `preg_match` – xkeshav Nov 08 '11 at 08:46
  • 8
    *Suggestion*: do not use regexes to manipulate HTML code. Use specific libraries/functions. – m0skit0 Nov 08 '11 at 08:49
  • 1
    Required reading: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Gibron Nov 08 '11 at 08:52
  • I was trying to extract only the attributes that are not inside tags. In the case which I posted a correct match would return value="" and not selected="selected" which depends on PHP to be output (is inside PHP tags). I forgot to mention that I'll be matching the regex using javascript's match() method. – CodeFan Nov 08 '11 at 08:52
  • I understand that using regex to parse a whole HTML template is not the best thing to embark on. In my case I'm only dissecting a single HTML tag into its logical components and it would work out great if it wasn't for the php code. I'm pretty sure there's ways to accomplish what I'm trying to do here using regex and the tag which I posted is the only pothole in the road. – CodeFan Nov 08 '11 at 08:59
  • I think the simplest thing to do here would be to strip out all <\?php.*?\?> content that is not preceded by a string that matches the pattern \w+=" and then take it from there. – CodeFan Nov 08 '11 at 09:01
  • @user1033923 The simplest thing is to use the provided DOM utilities : https://developer.mozilla.org/en/Traversing_an_HTML_table_with_Javascript_and_DOM_Interfaces. Yes you **CAN** do this with regex, provided you wrote your own HTML regex parser, but the question is : "Do you really want to do this?". – FailedDev Nov 08 '11 at 09:08
  • Actually I was indeed developing a simple javascript regex HTML/PHP parser which would only take single tags as input and return an object of the kind: {element_type: "tr", element_attributes: [{attribute_name: "id", value: [{value_type: "simple", text: "hpb_row"}, {value_type: "evaluated", varname: "hpb_rows"}], {attribute_name: "class", value: [{value_type: "simple", text: "wanted"}]}], element_children: []}; The attribute values would be defined in the object as either simple text or evaluated from the contents of a variable. – CodeFan Nov 08 '11 at 09:25
  • 2
    Where is the regex supposed to run, in JavaScript or in PHP? Because if it's running in client-side JavaScript then the PHP code won't be there (given that obviously it will have already run on the server) but if it's in PHP then there's no reason for this question to be tagged as "javascript". Also, if your regex is supposed to match HTML attributes shouldn't it include something to make sure it only matches _inside_ `<` and `>`? – nnnnnn Nov 08 '11 at 10:50
  • That's a very interesting question as it points me to a possibility I hadn't considered before, that is to run the text processing on the server and only return the javascript function which will create the HTML elements dynamically. My basic idea with this project is to pass HTML/PHP tag strings to a Javascript function with an optional parameter - a reference to its parent. The function will return a JSON type of object which defines the HTML/PHP tag which is passed to the function while monitoring the values of the attributes for PHP logic and substituting them with global JS variables. – CodeFan Nov 09 '11 at 10:38
  • ...at the end to have a tree-like object which is to be used by another function to create the HTML dynamically. – CodeFan Nov 09 '11 at 10:40

1 Answers1

0
/<\?php[\s\S]*?\?>|\s+(\w+)="([^"<]*(?:<\?php[\s\S]*?\?>[^<"]*)*)"/

This will match either a PHP code segment or a complete attribute="value" sequence in which the value may contain PHP code. After each match you can find out what you caught by checking the contents of the capturing groups. If it's a pure PHP segment you matched, all but group[0] will be empty; otherwise, group[1] will contain the attribute name and group[2] will contain the value.

The regex assumes < will appear inside an attribute value only as the beginning of a <?php tag. Of course that's not a syntactically valid assumption, but it's probably safe anyway. I can make the regex more precise if you need me to, but it will be also be much less readable.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156