0

I am trying to extract a list of domain names from a httrack data stream using grep. I have it close to working, but the result also includes any and all sub-domains.

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo "([0-9,a-z\.-]+)\.(com)"

Here is my current example result:

  • domain1.com
  • domain2.com
  • www.domain3.com
  • subdomain.domain4.com
  • whatever.domain5.com

Here is my desired example result.

  • domain1.com
  • domain2.com
  • domain3.com
  • domain4.com
  • domain5.com

Is there something I can add to this grep expression, or should I pipe it to a new sed expression to truncate any subdomains? And if so, how do I accomplish this task? I'm stuck. Any help is much appreciated.

Regards,

Wyatt

toolic
  • 57,801
  • 17
  • 75
  • 117
Wyatt Jackson
  • 303
  • 1
  • 2
  • 11

3 Answers3

1

You could drop the . in the grep pattern. The following should work

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | 
grep -iEo '[[:alnum:]-]+\.(com|net|org)'
iruvar
  • 22,736
  • 7
  • 53
  • 82
0

If you are just wanting to do a .com then the following will work as it will remove HTTP:// with or without an s, and the next sub-domains. As you can see though it will only work for a .com.

/(?:https?:\/\/[a-z09.]*?)([a-zA-Z0-9-]*\.com)/

Example Dataset

http://www.ilovefreestuff.com/
https://test.ilovefreestuff.com/
https://test.sub.ilovefreestuff.com/

REGEX101

That being said it is generally bad practice to parse and/or validate domain names using Regex as there are a ton of variants that can never be fully accounted for with the exception being when the conditions for matching and/or the dataset is clearly defined and not all encompassing. THIS post has more details on this process and covers a few more situations.

Community
  • 1
  • 1
MattSizzle
  • 3,145
  • 1
  • 22
  • 42
0

I use this code

include all domain & subdomains

grep -oE '[[:alnum:]_.-]+[.][[:alnum:]_.-]+' file_name | sed -re 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}//g' | sort -u > test.txt
Netwons
  • 1,170
  • 11
  • 14