bash shell text manipulation: I can extract a domain from a URL, how would I extend this to also exclude ".com" or ".co.uk" etc

Question

"get a domain from a url" is quite a common question here on this site and the answer I have used for a long time is from this question:

How to extract domain name from url?

The most popular answer has a comment from user "sakumatto" which also handles sub-domains too, it is this:

echo http://www.test.example.com:3030/index.php | sed -e "s/[^/]*\/\/$[^@]*@$\?$[^:/]*$.*/\2/" | awk -F. '{print $(NF-1) "." $NF}'

How would I further extend this command to exclude ".com" or ".co.uk" etc???

Insight:

I am writing a bash script for an amazing feature that Termux (Terminal emulator for Android) has, "termux-url-opener" that allows one to write a script that is launched when you use the native Android "share" feature, lets say i'm in the browser, github wants me to login, I press "share", then select "Termux", Termux opens and runs the script, echos the password to my clipboard and closes, now im automatically back in the browser with my password ready to paste! Its very simple and uses pass (password-store) with pass-clip extension, gnupg and pinentry here is what I have so far which works fine, but currently its dumb (it would need me to continue writing if/elif statements for every password I have in pass) so I would like to automate things, all I need is to cut ".com" or ".co.uk" etc.

Here is my script so far:

#!/data/data/com.termux/files/usr/bin/bash

URL="$1"
WEBSITE=$(echo "$URL" | sed -e "s/[^/]*\/\/\([^@]*@\)\?\([^:/]*\).*/\2/" | awk -F. '{print $(NF-1) "."  $NF}')

if [[ $WEBSITE =~ "github" ]]; then
# website contains "github"
  pass -c github
elif [[ $WEBSITE =~ "codeberg" ]]; then
# website contains "codeberg"
  pass -c codeberg
else
# is another app or website, so list all passwords entries.
  pass clip --fzf
fi

As my pass password entries are just website names e.g "github" or "codeberg" if I could cut the ".com" or ".co.uk" from the end then I could add something like:

PASSWORDS=$(pass ls)

Now I can check if "$1" (my shared URL) is a listed within pass ls and this stops having to write:

elif [[ $WEBSITE =~ "codeberg" ]]; then

For every single entry in pass.

Thank you! its really appreciated!

using both sed & awk on same command is dirty but works very well :) - for your question until your variable comes as a parameter you can test it to exclude the extentions you want to immediatly with bash substitutions or refilling it from a ` grep -v` or inside your sed or inside your awk ... just choose wich you prefer to do that work — francois P, Sep 21 '20 at 20:52
Yes I agree @francois P both sed and awk together in one command seems overkill, but like you said it is effective and over the last 2 years, no matter what I throw at it, it works perfectly, and its not slow so I never have tried to simplify it. Your suggestion makes sense to me, but after a couple attempts myself, I am unable to grasp it. Would you care to show an example? if so, I would be happy to mark it as accepted! Thank you for the suggestion though! :) — 5c0tt_b0t, Sep 21 '20 at 21:02

score 2 · Answer 1 · answered Sep 21 '20 at 21:34

2

i might be missing something, but why don't you just strip the offending TLDs from the hostname?

as in:

sed \
    -e "s|[^/]*//\([^@]*@\)\?\([^:/]*\).*|\2|" \
    -e 's|\.$||' \
    -e 's|\.com$||' \
    -e 's|\.co\.[a-zA-Z]*$||' \
    -e 's|.*\.\([^.]*\.[^.]*\)|\1|'

"s|[^/]*//$[^@]*@$\?$[^:/]*$.*|\2|" - this is your original regex, but using | as delimiter rather than / (gives you less quoting)
's|\.$||' - drop any accidently trailing dot (example.com. is a valid hostname!)
's|\.com$||' - remove trailing .com
's|\.co\.[a-zA-Z]*$||' - remove trailing .co.uk, .co.nl,...
's|.*\.$[^.]*\.[^.]*$|\1|' - remove all components from the hostname except for the last two (this is basically your awk-script)

answered Sep 21 '20 at 21:34

umläute

28,885
9
68
122

I have used your suggestion like so: ` ❯ echo http://www.mail.example.com:3030/index.php | sed -e "s/[^/]*\/\/$[^@]*@$\?$[^:/]*$.*/\2/" | sed -e 's|\.$||' | sed -e 's|\.com$||' | sed -e 's|\.co\.[a-zA-Z]*$||' | sed -e 's|.*\.$[^.]*\.[^.]*$|\1|'` but this results in: `mail.example` therefore doesnt work with sub-domains, this is why the command I use has the awk-fu added, to handle sub-domains also. – 5c0tt_b0t Sep 21 '20 at 22:32
So if i'm getting this correct your suggestion for my script, is to remove: `WEBSITE=$(echo "$URL" | sed -e "s/[^/]*\/\/$[^@]*@$\?$[^:/]*$.*/\2/" | awk -F. '{print $(NF-1) "." $NF}')` and use your suggestion instead, but how? I have tried using: `WEBSITE=$(echo "$1" | sed \ -e "s|[^/]*//([^@]*@)\?([^:/]*).*|\2|" \ -e 's|\.$||' \ -e 's|\.com$||' \ -e 's|\.co\.[a-zA-Z]*$||' \ -e 's|.*\.([^.]*\.[^.]*)|\1|')` but it didnt work, guessing my syntax is wrong, this is why an example i.e using what I already have (the script in my question) would be a better answer. – 5c0tt_b0t Sep 21 '20 at 22:45
you wrote *but this results in: `mail.example`*. well then you should actually specify what you expect in the first place - so far you haven'tg stated any specifications, and all expectations are implicit; make your expectations **explicit** - it really helps in solving problems. – umläute Sep 22 '20 at 06:06

score 0 · Answer 2 · edited Sep 23 '20 at 11:13

0

I propose you to work around a very simple modification like this grep command add:

WEBSITE=$(echo $1 | grep -vE ".com|.uk" | sed -e "s/[^/]*\/\/\([^@]*@\)\?\([^:/]*\).*/\2/" | awk -F. '{print $(NF-1) "."  $NF}')
test -z $WEBSITE && exit 1 # if empty (.com or .uk generates an empty variable)

$ cat > toto

WEBSITE=$(echo $1 | grep -vE ".com|.uk" | sed -e "s/[^/]*\/\/\([^@]*@\)\?\([^:/]*\).*/\2/" | awk -F. '{print $(NF-1) "."  $NF}')
test -z $WEBSITE && exit 1
echo $WEBSITE

With an example:

$ bash toto http://www.google.fr
google.fr
$ bash toto http://www.google.com
$ bash toto http://www.google.uk
$ bash toto http://www.google.gertrude
google.gertrude
$ rm toto
$

I used .uk in my example so do not just copy/paste the line.

edited Sep 23 '20 at 11:13

user3439894

7,266
3
17
28

answered Sep 21 '20 at 21:08

francois P

306
6
20

`grep -vE ".com|.uk"` will also match (and thus exclude) `git.company.it`, or even `https://github.com:7777/trash/smash.ukip` – umläute Sep 21 '20 at 21:42
it's a workaround not a endkey solution :) add ` grep -vE ".com$|.com\/ ....` and so on to match only site.com/something or .comENDOFLINE and do the same for .co.uk. ;) – francois P Sep 21 '20 at 21:45
Using `grep -vE ".com|.uk"` like so: `echo http://www.mail.example.com:3030/index.php | grep -vE ".com|.uk" | sed -e "s/[^/]*\/\/$[^@]*@$\?$[^:/]*$.*/\2/" | awk -F. '{print $(NF-1) "." $NF}'` does not work. – 5c0tt_b0t Sep 21 '20 at 22:41
even: `echo http://www.mail.example.com:3030/index.php | grep -vE ".com|.uk"` does not show anything. – 5c0tt_b0t Sep 21 '20 at 22:45

score 0 · Answer 3 · answered Sep 22 '20 at 06:04

How about doing it entirely within bash:

if [[ $WEBSITE =~ ^(.*)([.]co)[.][a-z]+$ || $WEBSITE =~ ^(.*)[.][a-z]+$ ]]
then
  pass=${BASH_REMATCH[1]}
else
  echo WARNING: Unexpected value for WEBSITE: $WEBSITE
  pass=$WEBSITE # Fallback
fi

I used two clauses (for the .co case and for the other cases), because bash a regexp does not understand non-greedy matching (i.e. .*?).

bash shell text manipulation: I can extract a domain from a URL, how would I extend this to also exclude ".com" or ".co.uk" etc

3 Answers3