How to get domains out of HTML with regexp

Question

I have a webpage with 10.000+ domainlist like:

<a href="2015-mail.com.html">2015-mail.com</a>
<a href="mail.ru.html">mail.ru</a>
<a href="tut.by.html">tut.by</a>

so i need a regular expression to get out of hyperlink the text only if it is domain-like...

Welcome to Stack Overflow! It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. — Avinash Raj, Mar 18 '15 at 11:10
[Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) — Biffen, Mar 18 '15 at 11:11

score 0 · Answer 1 · answered Mar 18 '15 at 11:26

If you are parsing HTML, it is not regex that you should be looking for. In case you just get this as plain text, then use href="(([^.]+\.)+[^.]*)" regex, the first capturing group will hold the domain-like word.

var re = /a.*href="(([^.]+\.)+[^.]*)"/; 
var str = '<a href="2015-mail.com.html">2015-mail.com</a>';
var m;

while ((m = re.exec(str)) != null) {
    if (m.index === re.lastIndex) {
        re.lastIndex++;
    }
    // Examine your matches (m[0] etc) here
}

Example

How to get domains out of HTML with regexp

1 Answers1