Using egrep to copy URLs

Question

I'm trying to make a script in bash that locates URLs from a textfile (example.com, example.eu, etc) and copies them over to another textfile using egrep. My current output gives me the URLs that i want, but unfortunately a lot more that i don't want, such as 123.123 or example.3xx.

My script currently looks like this:

egrep -o '\w*\.[^\d\s]\w{2,3}\b' trace.txt > url.txt

I tried using some regex checker sites, but the regex on the site gives me more of a correct answer than my own results.

Any help is appriceated

Does this answer your question? https://stackoverflow.com/questions/13611973/how-to-grep-for-a-url-in-a-file — Francesco Gasparetto, Mar 06 '20 at 14:36

score 0 · Answer 1 · edited Mar 07 '20 at 03:11

0

If you know the domains suffix, you can have a regex that looks for *.(com|eu|org)

edited Mar 07 '20 at 03:11

Bhargav Rao

50,140
28
121
140

answered Mar 06 '20 at 15:03

tavanez

13
7

That would help indeed, but i'm unsure how many suffixes i have that are domain suffixes and how many are .png or similar. I considered the option of downloading another textfile with all supported domain suffixes and cross-referencing the two files, but that sounds like a hassle. – Erik Mar 06 '20 at 15:33

score 0 · Answer 2 · edited Oct 07 '21 at 13:33

Based on https://stackoverflow.com/a/2183140/939457 (and https://www.rfc-editor.org/rfc/rfc2181#section-11) a domain name is a series of labels that can contain any char except . separated by .. Since you want only those valid TLDs you can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt to generate a list of patterns:

grep -i -E -f <(curl -s https://data.iana.org/TLD/tlds-alpha-by-domain.txt | sed 's/^/([^.]{1,63}\\\.){1,4}/') <<'EOF'
aaa.ali.bab.yandex
fsfdsa.d.s
alpha flkafj
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
EOF

Result:

aaa.ali.bab.yandex
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com

Note: this is a memory killer the above example took 2GB, the list of TLDs is huge, you might consider searching for a list of commonly used TLDs and use that instead.

Using egrep to copy URLs

2 Answers2