0

This is simple text file.

The URL:

  • Can have https:// or http://
  • Eliminate both as well as trailing url/ file paths
  • Extract only domains and/or subdomains

I have Notepad++ and EditPlus

open to other Suggestions?

Examples:

https://appspace.com

http://appspace.com/

http://ayurfit.ning.com/main/authorization/signIn

http://bangalore.olx.in/login.php

http://birthdayshoes.com/forum/index.php

http://birthdayshoes.com/forum/register/

http://forums.virtualbox.org/ucp.php

Tries:

/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/ 
^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$

https://regex101.com/r/hZ4cL4/4

Tried many on other machine as examples from Regex101

Found this little nugget as well. I'll post how its different once I understand it.

Regular Expression - Extract subdomain & domain

Community
  • 1
  • 1
Alex S
  • 242
  • 2
  • 17

2 Answers2

1

You could simply extract anything that is between two . Additionally you could use lookbehinds for http(s) and lookahead for the filepath to fine tune your results.

Binoy Dalal
  • 866
  • 10
  • 25
1

For the links that start with protocol, you can use the following regex:

(?<=://)[\w-]+(?:\.[\w-]+)+\b

See demo

The (?<=://) look-behind makes sure there is :// before the value we want to match, and the whole matched text consists of sequences of 1 or more word characters or hyphens ([\w-]+) that are eventually separated with periods.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you. Just one thing. With the other one `^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$` I was able to do Find Replace using \1 \2. Here, I cant pull with \0, \1 or \2. What am I missing? – Alex S Aug 20 '15 at 14:08
  • Check this one: [`^(?:https?://)?([^/.\n]+(?=\.))?(\.?[^/.\n]+\.[^/\n]+)/?(.*)$`](https://regex101.com/r/kB2bM9/2). Replace with `\1\2` or `$1$2`. – Wiktor Stribiżew Aug 20 '15 at 14:12
  • Is there anything else you need to match in that document of yours? :) – Wiktor Stribiżew Aug 21 '15 at 06:14
  • 1
    I doubt it. Just testing it out today on several data logs. I think it should be a success. I will mark your answer as accepted once thats finished. I doubt there will be any more bugs. Thanks again. You will see a confirm from my end as soon as its over. – Alex S Aug 21 '15 at 06:24
  • Found this little nugget as well http://stackoverflow.com/questions/25703360/regular-expression-extract-subdomain-domain?rq=1 – Alex S Aug 21 '15 at 06:50