0

Possible Duplicate:
How to parse HTML with PHP?

I would be most grateful if a regex master among you would be kind enough to help me.

I'd like to make a php function that converts html tags/elements, as per the following:

I want to convert

<span class="heading1">Any generic text, or other html elements such as <p> tags</p> in here</span>

To

<h1 class="heading1">Any text, or other html elements such as <p> tags</p> in here</h1>

...So basically I want to convert the span headings to proper h1 tags (this is for the purpose of better SEO) but there could be other normal span tags that I want to preserve.

Any ideas? Thanks in advance.

Community
  • 1
  • 1
Inigo
  • 8,110
  • 18
  • 62
  • 110
  • Wow. Right. I did read a lot of regex questions but they're not much good to me as regex is like double-dutch and obviously mine is a specific problem. I realise now this was a stupid question. I've never even heard of a html parser before. I'm reading this blog now http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html Thanks for pointing me in the right direction. – Inigo Sep 13 '11 at 18:26
  • 1
    Example with DOM: http://codepad.org/lcA9sbAb – Gordon Sep 13 '11 at 20:07

1 Answers1

0

Well, as the commenters above pointed out, it's probably not a good idea. However, since this case is extremely simple, the regex would be pretty easy if you want to live on the edge:

preg_replace('/<(\/*)span/', '<${1}h1', $htmlFile);

This will replace all span tags with h1 tags. Note that if there is any deviation from the format, it will break. Hence the warnings against this method. I would only recommend it if you are working with a small number of relatively small HTML files, so you can check them for errors.

EDIT: Yeah, if you only want to replace ones with class="heading1" I'm not touching it. That would require more mucking about with the regex than it would probably take to just fix all the files manually.

EDIT 2: Okay, I'm a little bored and curious, so I'm going to see if I can come up with a regex that would replace all class="heading1" spans and their corresponding closing tags with h1's:

preg_replace('/<span class="heading1">(.*(.*<span.*>.*<\/span>.*)*.*)<\/span>/', '<h1 class="heading1">${1}</h1>', $htmlFile);

If my calculations are correct, this should ignore any matching sets of span tags inside the heading1 span tags.

You're still probably better off using a DOM parser though.

Chriszuma
  • 4,464
  • 22
  • 19
  • Thanks, this is good... except that I think that if I were to have another span within the original heading1 span, like this: Something something else Something ...then your function would replace the first closing span rather than the one corresponding to the heading1 span, correct? If this is the case, I think I'd better just read up about html parsers :) (Unless you can add another cunning addition to your code to deal with this problem?!) Thanks for your help, I appreciate it. – Inigo Sep 13 '11 at 18:29
  • Now you see why it quickly gets out of hand. Take their advice above and use a DOM parser. – Chriszuma Sep 13 '11 at 18:33
  • Indeed I do. Thanks for your help anyway. – Inigo Sep 13 '11 at 18:37
  • Screw it, let's do this. I just want to see if I can. – Chriszuma Sep 13 '11 at 18:41
  • 1
    haha, well done, Chriszuma! -there just seems to be one little error in you code though, I had to remove the last backslash, so change the part <\/h1> to maybe you could edit? Apart from that, works like a charm! Many thanks! so the full function would be preg_replace('/(.*(.*.*<\/span>.*)*.*)<\/span>/', '

    ${1}

    ', $htmlFile);
    – Inigo Sep 13 '11 at 18:53