0

I want to parse out HTML from a string selectively. I have used strip_tags to allow div's, but I don't want to keep the div styles/classes from the string. That is, I want:

<div class="something">text</div>
<div style="something">text</div>

to simply become:

<div>text</div>

in both cases.

Can anyone help? Thanks!

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
Alex
  • 1
  • 1
  • 1
  • possible duplicate of [php regexp: remove all attributes from an html tag](http://stackoverflow.com/questions/3026096/php-regexp-remove-all-attributes-from-an-html-tag) – Gordon Nov 14 '10 at 22:28
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Nov 14 '10 at 22:28
  • I guess this is not a duplicate, because there may exist other attributes inside the div, that we want to preserve. – aldemarcalazans Sep 14 '22 at 20:06

4 Answers4

2

replace the following regex with nothing:

(?<=<div.*?)(?<!=\t*?"?\t*?)(class|style)=".*?"
J V
  • 11,402
  • 10
  • 52
  • 72
  • 1
    What if there is an attribute containing `class=` or `style=` like `
    `?
    – Gumbo Nov 14 '10 at 19:37
  • @J V: That won’t fix it, see for example `
    `.
    – Gumbo Nov 14 '10 at 19:42
  • Ok, it's getting complicated now, but I think I got it... Honestly, if the html is so screwed up regex is the last thing to worry about :) – J V Nov 14 '10 at 19:44
  • 1
    Never mind the whitespace, this regex won't work because it requires variable-length lookbehinds, and PHP (like most flavors) doesn't do that. Lookbehinds should never be your first resort anyway; there's almost always an easier way. – Alan Moore Nov 14 '10 at 21:10
  • Ah, in that case I cave to vincent :) – J V Nov 14 '10 at 21:41
1

Here is an example:

preg_replace('`<div (style="[^"]*"|class="[^"]*")>([^<]*)</div>`i', "<div>$1</div>", $str);

Basically, this matches the content of a div with a style or a class attribute. Then, you remove everything to keep only <div>content</div>.

It's longer than J V's version, but it won't replace something like <div style="blablabla" color="blablabla">content</div>, for instance. May or may not be what you want.

Vincent Savard
  • 34,979
  • 10
  • 68
  • 73
  • I see a problem using the very example the OP gave :) (Hint, repeaters are greedy) – J V Nov 14 '10 at 19:37
  • Actually, the . class is greedy. [^"] is not, it stops after the first " encountered. No worries, I test my code before I post (usually at least!) – Vincent Savard Nov 14 '10 at 19:39
  • Think about it, it doesn't make sense. I have a class that matches every character but ". What happens when it encounters a "? It stops matching. This has nothing to do with * or any quantifier. As I said, I tested my code with OP's example, it works correctly. – Vincent Savard Nov 14 '10 at 19:43
  • Ah yes I see... Although mine only deletes the style/class attribute itself so any other attributes remain. – J V Nov 14 '10 at 19:45
  • The problem with this code is that, if we have another attribute before class or style attribute (example: title="my page"), it will not work. – aldemarcalazans Sep 14 '22 at 19:31
0

As an option to regexp (which always freaks me out), I'd suggest so use xml_parse_into_struct.

See at php.net and it's first example.

Teson
  • 6,644
  • 8
  • 46
  • 69
0

I found out it's very difficult to build a single regex that, in a single pass, remove simultaneously class and style attributes inside a tag. That's because we don't know where this attributes will appear, together with other attributes inside the tag (supposing that we want to preserve the other ones). However, we can achieve that, splitting this task in two simpler search and replace operations: one for the class attribute and another for the style attribute.

To capture the first part of a div containing a class attribute, with one or more values enclosed in double quotes, the regex is as follows:

(<div\s+)([^>]*)(class\s*=\s*\"[^\">]*\")(\s|/|>)

The same code modified for single quotes:

(<div\s+)([^>]*)(class\s*=\s*\'[^\'>]*\')(\s|/|>)

Or no quotes:

(<div\s+)([^>]*)(class\s*=\s*[^\"\'=/>\s]+)(\s|/|>)

The captured string must then be replaced by the first, second and fourth capture group which, in PHP preg_replace() code, is represented by the string $1$2$4.

To eliminate a style attribute, instead a class one, just replace the substring class by the substring style in the regex. To eliminate these attributes in any tag (not only divs), replace the substring div by the substring [a-z][a-z0-9]* in the regex

Note: the regex above will not eliminate class or style attributes with syntax errors. Example: class="xxxxx (missing a quote after the value), class='xxxxx'' (excess of quotes after the value), class="xxxx"title="yyyy" (no space between attributes), and so on.

Short explanation:

<div\s+                  # beginning of the div tag, followed by one or more whitespaces
[^>]*                    # any set of attributes before the class (optional)
class\s*=\s*\"[^\">]*\"  # class attribute, with optional whitespaces
\s|/|>                   # one of these characters always follows the end of an attribute
aldemarcalazans
  • 1,309
  • 13
  • 16