1

I have the HTML from a page in a variable as just plain text. Now I need to remove some parts of the text. This is a part of the HTML that I need to change:

<div class="post"><a name="6188729"></a>
    <div class="igmline small" style="height: 20px; padding-top: 1px;">
        <span class="postheader_left">
            <a href="#"  style="font-size:9pt;"> RuneRifle </a>
            op 24.08.2012 om 21:41 uur
        </span>
        <span class="postheader_right">
            <a href="http://link">Citaat</a> <a href="http://link">Bewerken</a>
        </span>
        <div style="clear:both;"></div>
    </div>
    <div class="text">Testforum</div>
    <!-- Begin Thank -->
    <!-- Thank End -->
</div>

These replaces work:

pageData = pageData.replace(/href=\".*?\"/g, "href=\"#\"");
pageData = pageData.replace(/target=\".*?\"/g, "");

But this replace does not work at all:

pageData = pageData.replace(
  /<span class=\"postheader_right\">(.*?)<\/span>/g, "");

I need to remove every span with the class postheader_right and everything in it, but it just doesn't work. My knowledge of regex isn't that great so I'd appreciate if you would tell me how you came to your answer and a small explanation of how it works.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
Deep Frozen
  • 2,053
  • 2
  • 25
  • 42

2 Answers2

2

The dot doesn't match newlines. Use [\s\S] instead of the dot as it will match all whitespace characters or non-whitespace characters (i.e., anything).

As Mike Samuel says regular expressions are not really the best way to go given the complexity allowed in HTML (e.g., if say there is a line break after <a), especially if you have to look for attributes which may occur in different orders, but that's the way you can do it to match the case in your example HTML.

Brett Zamir
  • 14,034
  • 6
  • 54
  • 77
1

I need to remove every span with the class postheader_right and everything in it, but it just doesn't work.

Don't use regular expressions to find the spans. Using regular expressions to parse HTML: why not?

var allSpans = document.getElementsByClassName('span');
for (var i = allSpans.length; --i >= 0;) {
  var span = allSpans[i];
  if (/\bpostheader_right\b/.test(span.className)) {
    span.parentNode.removeChild(span);
  }
}

should do it.

If you only need to work on newer browsers then getElementsByClassName makes it even easier:

Find all div elements that have a class of 'test'

var tests = Array.filter( document.getElementsByClassName('test'), function(elem){
  return elem.nodeName == 'DIV';
});
Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • The point is that it's not an element or even on the page. `pageData` is a variable with HTML in it as plain text. – Deep Frozen Aug 25 '12 at 12:26
  • @Rune, can you stick it in a DIV via `innerHTML` and then use the DIV's `getElementsByTagName` or `getElementsByClassName` methods. – Mike Samuel Aug 25 '12 at 12:27
  • @Rune, If you trust the whole content to run in your page, then loading via `innerHTML` is a great way to get a tree that you can manipulate. but if you're trying to muck with HTML to avoid loading parts from untrusted third parties, then putting it into `innerHTML` is risky. `innerHTML` causes images to start loading, so `` will cause JavaScript execution even before the DIV you load it into is attached to the document. – Mike Samuel Aug 25 '12 at 12:39