0

I'm trying to make a script in bash that locates URLs from a textfile (example.com, example.eu, etc) and copies them over to another textfile using egrep. My current output gives me the URLs that i want, but unfortunately a lot more that i don't want, such as 123.123 or example.3xx.

My script currently looks like this:

egrep -o '\w*\.[^\d\s]\w{2,3}\b' trace.txt > url.txt

I tried using some regex checker sites, but the regex on the site gives me more of a correct answer than my own results.

Any help is appriceated

Erik
  • 9
  • 2

2 Answers2

0

If you know the domains suffix, you can have a regex that looks for *.(com|eu|org)

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
tavanez
  • 13
  • 7
  • That would help indeed, but i'm unsure how many suffixes i have that are domain suffixes and how many are .png or similar. I considered the option of downloading another textfile with all supported domain suffixes and cross-referencing the two files, but that sounds like a hassle. – Erik Mar 06 '20 at 15:33
0

Based on https://stackoverflow.com/a/2183140/939457 (and https://www.rfc-editor.org/rfc/rfc2181#section-11) a domain name is a series of labels that can contain any char except . separated by .. Since you want only those valid TLDs you can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt to generate a list of patterns:

grep -i -E -f <(curl -s https://data.iana.org/TLD/tlds-alpha-by-domain.txt | sed 's/^/([^.]{1,63}\\\.){1,4}/') <<'EOF'
aaa.ali.bab.yandex
fsfdsa.d.s
alpha flkafj
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
EOF

Result:

aaa.ali.bab.yandex
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com

Note: this is a memory killer the above example took 2GB, the list of TLDs is huge, you might consider searching for a list of commonly used TLDs and use that instead.

Community
  • 1
  • 1
Sorin
  • 5,201
  • 2
  • 18
  • 45