0

I want to take from HTML document, all links, except ones with specified class name, using REGEX.

For example:

<a href="someSite" class="className">qwe</a> <a href="someSite">qwe</a>

As a result i want to have only href="someSite" from link which does not contain class equal to "className"

I've created regex:

(?<=<\s*a.*)href\s*?=\s*?("|').*?("|')

which returns exacly what I want, but from all of links and I have no idea how to add an exception to my Regex to not retrurning links with class name specified

Any help will be appreciated :)

user1482528
  • 53
  • 2
  • 6
  • What do you want to use PHP, ASP.NET or JavaScript? Also take into acount: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – RoToRa Jun 26 '12 at 11:42
  • Regex for parsing HTML isn't bad at all. As it is only about links... http://stackoverflow.com/a/1733489/402037 (same link as from RoToRa but another answer) :) – Andreas Jun 26 '12 at 11:47

5 Answers5

2

If you are open to use jQuery, you can do that without using Regex:

 var list = $("a", document).filter(function () {
                return $(this).hasClass("className") == false;
            });
Kapil Khandelwal
  • 15,958
  • 2
  • 45
  • 52
  • Could you also use var list = $("a", document).not('.classname') – theedam Jun 26 '12 at 11:30
  • 1
    Or without any library (but only with "modern" browser^^): `([].slice.apply(document.links)).filter(function(a) { return a.className === "className"; });` :D – Andreas Jun 26 '12 at 11:36
  • Thanks, Unfortunately, I have to make it in code behind scenes (.NET). Solution of using jquery instead of regex looks very nice, but in my scenario, it's most wanted to make it, as I mentioned on the server side.... – user1482528 Jun 26 '12 at 11:51
0

Assuming you have the HTML in some variable, you could make use of http://code.google.com/p/phpquery/wiki/Selectors (phpquery - a php jQuery-esq thing for php).

Brian
  • 8,418
  • 2
  • 25
  • 32
0

The other answers are sensible. But if for any reason you insist on a REGEX approach. try this.

I'm assuming you're doing your REGEX via PHP (or .NET) since your pattern included a negative look-behind assertion, which isn't supported in JavaScript.

I've also split the matching from the filtering out of those with bad classes, since REGEX is not ideal for the latter (since the class attribute may appear at any point within a link's opening tag).

$str = "<a href='bad_href' class='badClass'>bad link</a> <a href='good_href'>good link</a>";
preg_match_all('/<a.+(href ?= ?("|\')[^\2]*\2).*>.*<\/a>/U', $str, $matches);
foreach($matches[0] as $key => $match)
    if (preg_match('/class=(\'|")[^\1]*badClass[^\1]*\1/', $match))
        unset($matches[1][$key]);
$matches = $matches[1]; //array containing "href='good_href'"
Mitya
  • 33,629
  • 9
  • 60
  • 107
0
var aList= document.getElementsByTagName('a');
for (var i in aList) {
   if (aList.hasOwnProperty(i)) {
     if (aList[i].className.indexOf(YourClassName) != -1) continue;
    //... 
    //... Your code
   }
}
ZigZag
  • 539
  • 1
  • 8
  • 19
0

Disclaimer:

As others will or have already pointed out, using regex to parse non-regular languages is fraught with peril! It is best to use a dedicated parser specifically designed for the job, especially when parsing the tag soup that is HTML.

That said...

If you insist on using a regular expression, here is a tested PHP script implementing a regex solution that does a "pretty good" job:

<?php // test.php Rev:20120626_2100

function strip_html_anchor_tags_not_having_class($text) {
    $re_html_anchor_not_having_class ='% # Rev:20120626_1300
    # Match an HTML 4.01 A element NOT having a specific class.
    <a\b                   # Anchor element start tag open delimiter
    (?:                    # Zero or more attributes before CLASS.
      \s+                  # Attributes are separated by whitespace.
      (?!class\b)          # Only non-CLASS attributes here.
      [A-Za-z][\w\-:.]*    # Attribute name is required.
      (?:                  # Attribute value is optional.
        \s*=\s*            # Name and value separated by =
        (?:                # Group for value alternatives.
          "[^"]*"          # Either a double-quoted string,
        | \'[^\']*\'       # or a single-quoted string,
        | [\w\-:.]+        # or a non-quoted string.
        )                  # End group of value alternatives.
      )?                   # Attribute value is optional.
    )*                     # Zero or more attributes before CLASS.
    (?:                    # Optional CLASS (but only if NOT MyClass).
      \s+                  # CLASS attribute is separated by whitespace.
      class                # (case insensitive) CLASS attribute name.
      \s*=\s*              # Name and value separated by =
      (?:                  # Group allowable CLASS value alternatives.
        (?-i)              # Use case-sensitive match for CLASS value.
        "                  # Either a double-quoted value...
        (?:                # Single-char-step through CLASS value.
          (?!              # Assert each position is NOT MyClass.
            (?<=["\s])     # Preceded by opening quote or space.
            MyClass        # (case sensitive) CLASS value to NOT be matched.
            (?=["\s])      # Followed by closing quote or space.
          )                # End assert each position is NOT MyClass.
          [^"]             # Safe to match next CLASS value char.
        )*                 # Single-char-step through CLASS value.
        "                  # Ok. DQ value does not contain MyClass.
      | \'                 # Or a single-quoted value...
        (?:                # Single-char-step through CLASS value.
          (?!              # Assert each position is NOT MyClass.
            (?<=[\'\s])    # Preceded by opening quote or space.
            MyClass        # (case sensitive) CLASS value to NOT be matched.
            (?=[\'\s])     # Followed by closing quote or space.
          )                # End assert each position is NOT MyClass.
          [^\']            # Safe to match next CLASS value char.
        )*                 # Single-char-step through CLASS value.
        \'                 # Ok. SQ value does not contain MyClass.
      |                    # Or a non-quoted, non-MyClass value...
        (?!                # Assert this value is NOT MyClass.
          MyClass          # (case sensitive) CLASS value to NOT be matched.
        )                  # Ok. NQ value is not MyClass.
        [\w\-:.]+          # Safe to match non-quoted CLASS value.
      )                    # End group of allowable CLASS values.
      (?:                  # Zero or more attribs allowed after CLASS.
        \s+                # Attributes are separated by whitespace.
        [A-Za-z][\w\-:.]*  # Attribute name is required.
        (?:                # Attribute value is optional.
          \s*=\s*          # Name and value separated by =
          (?:              # Group for value alternatives.
            "[^"]*"        # Either a double-quoted string,
          | \'[^\']*\'     # or a single-quoted string,
          | [\w\-:.]+      # or a non-quoted string.
          )                # End group of value alternatives.
        )?                 # Attribute value is optional.
      )*                   # Zero or more attributes after CLASS.
    )?                     # Optional CLASS (but only if NOT MyClass).
    \s*                    # Optional whitespace before closing >
    >                      # Anchor element start tag close delimiter
    (                      # $1: Anchor element contents.
      [^<]*                # {normal*} Zero or more non-<
      (?:                  # Begin {(special normal*)*} construct
        <                  # {special} Allow a < but only if
        (?!/?a\b)          # not the start of the </a> close tag.
        [^<]*              # more {normal*} Zero or more non-<
      )*                   # Finish {(special normal*)*} construct
    )                      # End $1: Anchor element contents.
    </a\s*>                # A element close tag.
    %ix';
    // Remove all matching start and end tags but keep the element contents.
    return preg_replace($re_html_anchor_not_having_class, '$1', $text);
}
$input = file_get_contents('testdata.html');
$output = strip_html_anchor_tags_not_having_class($input);
file_put_contents('testdata_out.html', $output);
?>

function strip_html_anchor_tags_not_having_class($text)

This function strips the start and matching end tags for all HTML 4.01 Anchor elements (i.e. <A> tags) which do NOT have the specific, (case-sensitive) CLASS attribute value containing: MyClass. The CLASS value may contain any number of values, but one of them must be exactly: MyClass. The Anchor tag names and the CLASS attribute name are matched case insensitively.

Example input (testdata.html):

<h2>Paragraph contains links to be preserved (CLASS has "MyClass"):</h2>
<p>
Single DQ matching CLASS: <a href="URL" class="MyClass">Test 01</a>.
Single SQ matching CLASS: <a href="URL" class='MyClass'>Test 02</a>.
Single NQ matching CLASS: <a href="URL" class=MyClass>Test 03</a>.
Variable whitespace: <a href = "URL" class = MyClass >Test 04</a>.
Variable capitalization: <A HREF = "URL" CLASS = "MyClass" >Test 04</A>.
Reversed attribute order: <a class="MyClass" href="URL">Test 05</a>
Class before MyClass: <a href="URL" class="Pre MyClass">Test 06</a>.
Class after MyClass: <a href="URL" class="MyClass Post">Test 07</a>.
Sandwiched MyClass: <a href="URL" class="Pre MyClass Post">Test 08</a>.
Link with HTML content: <a class="MyClass" href="URL"><b>Test</b> 09</a>.
</p>

<h2>Paragraph contains links to be stripped (NO CLASS with "MyClass"):</h2>
<p>
Case does not match: <a href="URL" class="myclass">TEST 10</a>.
CLASS not whole word: <a href="URL" class="NotMyClass">TEST 11</a>.
No class attribute: <a href="URL">TEST 12</a>.
Link with HTML content: <a class="NotMyClass" href="URL"><b>Test</b> 13</a>.
</p>

Example output (testdata_out.html):

<h2>Paragraph contains links to be preserved (CLASS has "MyClass"):</h2>
<p>
Single DQ matching CLASS: <a href="URL" class="MyClass">Test 01</a>.
Single SQ matching CLASS: <a href="URL" class='MyClass'>Test 02</a>.
Single NQ matching CLASS: <a href="URL" class=MyClass>Test 03</a>.
Variable whitespace: <a href = "URL" class = MyClass >Test 04</a>.
Variable capitalization: <A HREF = "URL" CLASS = "MyClass" >Test 04</A>.
Reversed attribute order: <a class="MyClass" href="URL">Test 05</a>
Class before MyClass: <a href="URL" class="Pre MyClass">Test 06</a>.
Class after MyClass: <a href="URL" class="MyClass Post">Test 07</a>.
Sandwiched MyClass: <a href="URL" class="Pre MyClass Post">Test 08</a>.
Link with HTML content: <a class="MyClass" href="URL"><b>Test</b> 09</a>.
</p>

<h2>Paragraph contains links to be stripped (NO CLASS with "MyClass"):</h2>
<p>
Case does not match: TEST 10.
CLASS not whole word: TEST 11.
No class attribute: TEST 12.
Link with HTML content: <b>Test</b> 13.
</p>

The reader wishing to advanced their regex-fu would do well to study this (rather long and complex) regex. It is carefully handcrafted for both accuracy and speed and implements several advanced efficiency techniques. It is, of course, fully commented to allow readability by mere humans. This example clearly demonstrates that "REGULAR EXPRESSIONS" have evolved into a rich, (non-REGULAR) programming language.

Note that there will always be edge cases where this solution will fail. e.g. Evil strings within CDATA sections, comments, scripts, styles and tag attribute values can trip this up. (See disclaimer above.) That said, this solution will do a pretty good job for many cases (but will never be 100% reliable!)

ridgerunner
  • 33,777
  • 5
  • 57
  • 69