-2

I would like to use Regex to sort urls. Moreover, I have domain that I want to exclude. I want to keep all the domain that belong to www.test.com (fictitious example) except : www.test.com/..., www.abc.test.com/..., www.def.test.com/...

The website that I am interested in has many many other subdomains such as www.ghi.test.com , www.jkl.test.com, www.a.test.com ...

I tried to use negative lookup but I have a hard time to find an expression that I'm happy with.

Also I am interested to know if I can just give a list of subdomain and create the regex expression adapted to my case, for example ['www','abc','def']. That would be so much easier to add exceptions

Thanks!

th0mash
  • 185
  • 3
  • 13

2 Answers2

1

Sure thing: (?!www.(?:abc|def).test.com)(?=www.*\.test\..*com)^.+$

This uses a negative lookahead to assert that the match you eventually get does NOT include any of the subdomains you don't want, and includes a positive lookahead to ensure we're matching www.test.com (in some form).

  • (?!www.(?:abc|def).test.com) assert that abc or def don't appear in the url. You can add as much to this list as you'd like.
  • (?=www.*\.test\..*com) assert that somewhere between www and com, .test. appears.
  • ^.+$ capture any non-zero-length lines that pass both the lookaheads.

Try it here!

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
0

Do you mean that you effectively want to categorize all *.test.com sub domains as belonging to test.com?

If so, you can use a variation of: Get Domain Extension From Hostname

function getDomain(domain) {
  const domainExpression = /\w+((\.[a-z]{2,3})(\.(ad|ae|af|ag|ai|al|am|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw))?)$/i;
  const match = domainExpression.exec(domain);
  
  return match ? match[0] : domain;
}

function test(input, expectedOutput) {
  const output = getDomain(input);
  console.log(`${output === expectedOutput ? 'PASS' : 'FAIL'}: ${input} (expected: ${expectedOutput}, output: ${output})`);
}

test('www.test.com', 'test.com');
test('www.abc.test.com', 'test.com');
test('www.jjj.sss.test.com', 'test.com');
test('www.test.com.au', 'test.com.au');
test('www.sub.test.com.au', 'test.com.au');
Soc
  • 7,425
  • 4
  • 13
  • 30
  • Quite similar to that, in my project i want to crawl something.chinadaily.com.cn but not chinadaily.com.cn neither someotherthings.chinadaily.com.cn Therefore I was trying to do some if conditions – th0mash Oct 05 '19 at 10:25