Clean and extract Subdomains & Domains from URLs using Regex Notepad++

Question

This is simple text file.

The URL:

Can have https:// or http://
Eliminate both as well as trailing url/ file paths
Extract only domains and/or subdomains

I have Notepad++ and EditPlus

open to other Suggestions?

Examples:

https://appspace.com

http://appspace.com/

http://ayurfit.ning.com/main/authorization/signIn

http://bangalore.olx.in/login.php

http://birthdayshoes.com/forum/index.php

http://birthdayshoes.com/forum/register/

http://forums.virtualbox.org/ucp.php

Tries:

/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/ 
^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$

https://regex101.com/r/hZ4cL4/4

Tried many on other machine as examples from Regex101

Found this little nugget as well. I'll post how its different once I understand it.

Regular Expression - Extract subdomain & domain

Will do so. Its on another machine. Hold up. Copy pasting my tries. — Alex S, Aug 20 '15 at 13:24
/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/ - I think there should be a simpler way? — Alex S, Aug 20 '15 at 13:34
@stribizhev - Just did - It picks up the /index.php as well. — Alex S, Aug 20 '15 at 13:40
@stribizhev - What you posted is good. I just need to have it match the sub/domain after http/s:// and avoid the /.php etc — Alex S, Aug 20 '15 at 13:44
Like [`(?<=//)[\w-]+(?:\.[\w-]+)+\b`](https://regex101.com/r/gT6lK1/1)? — Wiktor Stribiżew, Aug 20 '15 at 13:47
It works almost perfect but hits this and splits it into two: `http://www.911cd.net/forums//index.php` — Alex S, Aug 20 '15 at 13:55
Thats a fault in data - Your answer is right. Please post as Answer and i will give it credit @stribizhev — Alex S, Aug 20 '15 at 13:59
@stribizhev - Thank you. Just one thing. With the other one `^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$` I was able to do Find Replace using \1 \2. Here, I cant pull with \0, \1 or \2. What am I missing? Please add to answer so I can test and select yours finally. — Alex S, Aug 20 '15 at 14:11

score 1 · Answer 1 · answered Aug 20 '15 at 13:28

1

You could simply extract anything that is between two . Additionally you could use lookbehinds for http(s) and lookahead for the filepath to fine tune your results.

answered Aug 20 '15 at 13:28

Binoy Dalal

866
10
25

score 1 · Accepted Answer · answered Aug 20 '15 at 14:02

1

For the links that start with protocol, you can use the following regex:

(?<=://)[\w-]+(?:\.[\w-]+)+\b

See demo

The (?<=://) look-behind makes sure there is :// before the value we want to match, and the whole matched text consists of sequences of 1 or more word characters or hyphens ([\w-]+) that are eventually separated with periods.

answered Aug 20 '15 at 14:02

Wiktor Stribiżew

607,720
39
448
563

Thank you. Just one thing. With the other one `^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$` I was able to do Find Replace using \1 \2. Here, I cant pull with \0, \1 or \2. What am I missing? – Alex S Aug 20 '15 at 14:08
Check this one: [`^(?:https?://)?([^/.\n]+(?=\.))?(\.?[^/.\n]+\.[^/\n]+)/?(.*)$`](https://regex101.com/r/kB2bM9/2). Replace with `\1\2` or `$1$2`. – Wiktor Stribiżew Aug 20 '15 at 14:12
Is there anything else you need to match in that document of yours? :) – Wiktor Stribiżew Aug 21 '15 at 06:14
1

I doubt it. Just testing it out today on several data logs. I think it should be a success. I will mark your answer as accepted once thats finished. I doubt there will be any more bugs. Thanks again. You will see a confirm from my end as soon as its over. – Alex S Aug 21 '15 at 06:24
Found this little nugget as well http://stackoverflow.com/questions/25703360/regular-expression-extract-subdomain-domain?rq=1 – Alex S Aug 21 '15 at 06:50

Clean and extract Subdomains & Domains from URLs using Regex Notepad++

2 Answers2