62

How do I extract the domain name from a url using bash? like: http://example.com/ to example.com must work for any tld, not just .com

jww
  • 97,681
  • 90
  • 411
  • 885
Ben Smith
  • 621
  • 1
  • 5
  • 3
  • Dup: http://stackoverflow.com/questions/827024/how-do-i-extract-the-domain-out-of-an-url – Dennis Williamson Mar 23 '10 at 07:04
  • That is Perl, not Bash, though. –  Apr 22 '10 at 00:34
  • Basically all of the answers here are broken, except bewilderingly the Ruby one. You need to know the subdomain policy of the top-level domain before you can decide which is the root domain. Look for the Public Suffic database. In very brief, you want to handle cases like `www.surrey.bbc.co.uk`, `www.nic.ad.jp`, `www.city.nagoya.jp`, etc. – tripleee Nov 14 '22 at 13:20
  • @tripleee: Posted today a [pure bash answer](https://stackoverflow.com/a/74948263/1765658) with a chapter addressing your comment! – F. Hauri - Give Up GitHub Dec 29 '22 at 10:33

16 Answers16

99

You can use simple AWK way to extract the domain name as follows:

echo http://example.com/index.php | awk -F[/:] '{print $4}'

OUTPUT: example.com

:-)

sorin
  • 161,544
  • 178
  • 535
  • 806
Soj
  • 1,015
  • 7
  • 2
  • Nicee, this is so much better then the answers provided in https://stackoverflow.com/questions/6174220/parse-url-in-shell-script ! – bk138 Dec 06 '14 at 01:19
  • 11
    `echo http://example.com:3030/index.php | awk -F/ '{print $3}'` `example.com:3030` :-( – Ben Burns Mar 24 '15 at 09:16
  • you could split on `:` again to get it, but its not flexible enough to accept both with and without port. – chovy Dec 29 '15 at 03:30
  • | awk -F/ '{print $3}' | awk -F: '{print $1}' – Andrew Mackenzie Mar 15 '16 at 12:34
  • What if i need this - http(s)://example.com? I tried printing $1$3 it gives this - http:example.com (missing '//' after http) any idea? – 3AK Jun 16 '16 at 05:14
  • 3
    I got it by using this - echo `http://www.example.com/somedir/someotherdir/index.html` | cut -d'/' -f1,2,3 gives `http://www.example.com` – 3AK Jun 16 '16 at 05:44
  • 7
    To handle urls with and without ports: `awk -F[/:] '{print $4}'` – Michael Oct 06 '17 at 14:35
  • @Michael If I also want to remove www but not any other subdomain (e.g., www.example.com -> example.com but home.example.com -> home.example.com)? – d-b Jun 13 '18 at 06:15
  • On MacOS it makes sense to do this: `echo http://example.com/index.php | awk -F/ '{print $3}' | awk -F: '{print $1}'` – derFunk Aug 21 '18 at 15:46
  • in case the URL contains `&` wrap it around the quotes while passing as the parameter. – Vishrant Oct 09 '20 at 01:46
  • This does not work without http or https. for example example.com/index.php/test, would return blank – MaXi32 Jul 31 '21 at 11:06
36
$ URI="http://user:pw@example.com:80/"
$ echo $URI | sed -e 's/[^/]*\/\/\([^@]*@\)\?\([^:/]*\).*/\2/'
example.com

see http://en.wikipedia.org/wiki/URI_scheme

Flimm
  • 136,138
  • 45
  • 251
  • 267
user300653
  • 487
  • 4
  • 3
  • 3
    This works with or without port, deep paths and is still using bash. although it doesn't work on mac. – chovy Dec 29 '15 at 03:34
  • 7 years later, this is still my go-to answer. – mwoodman Oct 19 '17 at 17:14
  • 2
    I use your suggestion with a little extra to strip out any subdomains that might be in the url ->> `echo http://www.mail.example.com:3030/index.php | sed -e "s/[^/]*\/\/\([^@]*@\)\?\([^:/]*\).*/\2/" | awk -F. '{print $(NF-1) "." $NF}'` so I basically cut your output at the dot and take the last & second to last column and patch them back with the dot. – sakumatto Nov 01 '17 at 14:33
  • **This is the best answer!** I used this for a ping command that allows full URLs: https://unix.stackexchange.com/a/428990/20661 stripping only the `www.` subdomain – rubo77 Mar 08 '18 at 10:52
  • 1
    For those who want to get the port: `sed -e "s/[^/]*\/\/\([^@]*@\)\?\([^:/]*\)\(:\([0-9]\{1,5\}\)\)\?.*/\4/"` – wheeler Apr 26 '18 at 23:38
  • 1
    @sakumatto works fine, but how would it be to support "https://example.com.uk" for example? – sanNeck Apr 15 '21 at 17:11
30
basename "http://example.com"

Now of course, this won't work with a URI like this: http://www.example.com/index.html but you could do the following:

basename $(dirname "http://www.example.com/index.html")

Or for more complex URIs:

echo "http://www.example.com/somedir/someotherdir/index.html" | cut -d'/' -f3

-d means "delimiter" and -f means "field"; in the above example, the third field delimited by the forward slash '/' is www.example.com.

musashiXXX
  • 4,192
  • 4
  • 22
  • 24
  • 5
    I like cut -d'/' -f3 for its simplicity. – Jamie Kitson Mar 14 '12 at 13:40
  • 1
    fails if you add a port: `echo "http://www.example.com:8080/somedir/someotherdir/index.html" | cut -d'/' -f3` – chovy Dec 29 '15 at 03:31
  • got this - `http://www.example.com` by running - echo `http://www.example.com/somedir/someotherdir/index.html | cut -d'/' -f1,2,3` – 3AK Jun 16 '16 at 05:49
  • `basename $(dirname` does not work, if the url ends with the domain like: `basename $(dirname "http://www.example.com/")` will show just: `http:` – rubo77 Mar 08 '18 at 10:37
18
echo $URL | cut -d'/' -f3 | cut -d':' -f1

Works for URLs:

http://host.example.com
http://host.example.com/hi/there
http://host.example.com:2345/hi/there
http://host.example.com:2345
keyoxy
  • 4,423
  • 2
  • 21
  • 18
  • 1
    I found this more useful as it would return the url as it is when it doesn't contain 'http://' i.e. `abc.com` will be retained as `abc.com` – Udayraj Deshmukh Nov 05 '18 at 08:16
  • This is in fact the most intuitive, concise and effective method of all the answers here! – Robert Aug 15 '21 at 14:22
  • 1
    This extracts `host.example.com` rather than the domain name (`example.com`) asked for. – Lucas Apr 05 '22 at 19:19
11
sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_'

e.g.

$ sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' <<< 'http://example.com'
example.com

$ sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' <<< 'https://example.com'
example.com

$ sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' <<< 'http://example.com:1234/some/path'
example.com

$ sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' <<< 'http://user:pass@example.com:1234/some/path'
example.com

$ sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' <<< 'http://user:pass@example.com:1234/some/path#fragment'
example.com

$ sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' <<< 'http://user:pass@example.com:1234/some/path#fragment?params=true'
example.com
Armand
  • 23,463
  • 20
  • 90
  • 119
  • Boom! `HOST=$(sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' <<< "$MYURL")` is fine in Bash – 4Z4T4R May 26 '17 at 17:58
  • I would like to crop www from domain. In this case, how should I change the command properly? – Ceylan B. Apr 25 '19 at 08:22
  • thanks for this, very handy, to capture path from URL I extend this slightly `sed -E -e 's_.*://([^/@]*@)?([^/:]+)(.*)_\2_' <<< 'http://example.com'` this allow you to grab path from url sed -E -e 's_.*://([^/@]*@)?([^/:]+)(.*)_\3_' <<< 'http://example.com/path/to/something' – Max Barrass May 05 '22 at 03:53
7
#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];

if($url =~ /([^:]*:\/\/)?([^\/]+\.[^\/]+)/g) {
  print $2;
}

Usage:

./test.pl 'https://example.com'
example.com

./test.pl 'https://www.example.com/'
www.example.com

./test.pl 'example.org/'
example.org

 ./test.pl 'example.org'
example.org

./test.pl 'example'  -> no output

And if you just want the domain and not the full host + domain use this instead:

#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+\.[^\/]+)/g) {
  print $3;
}
Dark Castle
  • 1,289
  • 2
  • 9
  • 20
  • Of course the last one doesn't know about "www.example.co.uk" http://search.cpan.org/~nmelnick/Domain-PublicSuffix-0.04/lib/Domain/PublicSuffix.pm – Dennis Williamson Mar 23 '10 at 07:03
  • True, and if there is an API for it obviously I'd go with that anyway. Seems like the complete solution would actually have to know all valid country codes and check to see if the last post-dot region was a country code... – Dark Castle Mar 23 '10 at 13:56
6

Instead of using regex to do this you can use python's urlparse:

 URL=http://www.example.com

 python -c "from urlparse import urlparse
 url = urlparse('$URL')
 print url.netloc"

You could either use it like this or put it in a small script. However this still expects a valid scheme identifier, looking at your comment your input doesn't necessarily provide one. You can specify a default scheme, but urlparse expects the netloc to start with '//' :

url = urlparse('//www.example.com/index.html','http')

So you will have to prepend those manually, i.e:

 python -c "from urlparse import urlparse
 if '$URL'.find('://') == -1 then:
   url = urlparse('//$URL','http')
 else:
   url = urlparse('$URL')
 print url.netloc"
Garns
  • 416
  • 3
  • 4
4

3 answers: short URL parsing (+) and full TLD extractor

Remark about question:

Question stand for , but the goal there is to split string on / character!! XY problem, using regex for this kind of job is overkill!

Posix shell first

Instead of using forks to another binaries, like awk, perl, cut or else, we could use parameter expansions which is quicker:

URL="http://example.com/some/path/to/page.html"
prot="${URL%%:*}"
link="${URL#$prot://}"
domain="${link%%/*}"
link="${link#$domain}"
printf '%-8s: %s\n' Protocol "${prot%:}" Domain "$domain" Link "$link"
Protocol: http
Domain  : example.com
Link    : /some/path/to/page.html

Note: This work even with file URL:

URL=file:///tmp/so/test.xml 
prot="${URL%%:*}"
link="${URL#$prot://}"
domain="${link%%/*}"
link="${link#$domain}"
printf '%-8s: %s\n' Protocol "${prot%:}" Domain "$domain" Link "$link"
Protocol: file
Domain  : 
Link    : /tmp/so/test.xml

read url parts using bash

As this question is tagged and no answer address read short, quick and reliable solution:

URL="http://example.com/some/path/to/page.html"

IFS=/ read -r prot _ domain link <<<"$URL"

That's all. As read is a builtin, this is the quickest way!! (** See comment)

From there you could

printf '%-8s: %s\n' Protocol "${prot%:}" Domain "$domain" Link "/$link"
Protocol: http
Domain  : example.com
Link    : /some/path/to/page.html

You could even check for port:

URL="http://example.com:8000/some/path/to/page.html"
IFS=/ read -r prot _ domain link <<<"$URL"
IFS=: read -r domain port <<<"$domain"

printf '%-8s: %s\n' Protocol "${prot%:}" Domain "$domain" Port "$port" Link "/$link"
Protocol: http
Domain  : example.com
Port    : 8000
Link    : /some/path/to/page.html

Full parsing with default ports:

URL="https://stackoverflow.com/questions/2497215/how-to-extract-domain-name-from-url"
declare -A DEFPORTS='([http]=80 [https]=443 [ipp]=631 [ftp]=21)'
IFS=/ read -r prot _ domain link <<<"$URL"
IFS=: read -r domain port <<<"$domain"

printf '%-8s: %s\n' Protocol "${prot%:}" Domain "$domain" \
    Port  "${port:-${DEFPORTS[${prot%:}]}}" Link "/$link"
Protocol: https
Domain  : stackoverflow.com
Port    : 443
Link    : /questions/2497215/how-to-extract-domain-name-from-url

Full Top Level Domain extractor (in pure bash):

Regarding public suffix and @tripleee' comment

There is one fork to wget which is done only once, at function initialization:

declare -A TLD='()'
initTld () { 
    local tld
    while read -r tld; do
        [[ -n ${tld//*[ \/;*]*} ]] && TLD["${tld#\!}"]=''
    done < <(
      wget -qO - https://publicsuffix.org/list/public_suffix_list.dat
    )
}
tldExtract () { 
    if [[ $1 == -v ]] ;then local _tld_out_var=$2;shift 2;fi
    local dom tld=$1 _tld_out_var
    while [[ ! -v TLD[${tld}] ]] && [[ -n $tld ]]; do
        IFS=. read -r dom tld <<< "$tld"
    done
    if [[ -v _tld_out_var ]] ;then
        printf -v $_tld_out_var '%s %s' "$dom" "$tld"
    else
        echo "$dom $tld"
    fi
}
initTld ; unset -f initTld

Then

tldExtract www.stackoverflow.com
stackoverflow com

tldExtract sub.www.test.co.uk
test co.uk

tldExtract -v myVar sub.www.test.co.uk
echo ${myVar% *}
test
echo ${myVar#* }
co.uk

tldExtract -v myVar www2.sub.city.nagoya.jp
echo $myVar 
sub city.nagoya.jp
F. Hauri - Give Up GitHub
  • 64,122
  • 17
  • 116
  • 137
  • Quicker function: `parseUrl() { local IFS=/ arry;arry=($4);printf -v $1 ${arry%:};printf -v $2 ${arry[2]};printf -v $3 "/${arry[*]:3}";}` to be used as `read` replacment: `parseUrl prot domain link "$URL"` for populating `$prot $domain` and `$link` variuables – F. Hauri - Give Up GitHub Mar 23 '23 at 11:43
3

there is so little info on how you get those urls...please show more info next time. are there parameters in the url etc etc... Meanwhile, just simple string manipulation for your sample url

eg

$ s="http://example.com/index.php"
$ echo ${s/%/*}  #get rid of last "/" onwards
http://example.com
$ s=${s/%\//}  
$ echo ${s/#http:\/\//} # get rid of http://
example.com

other ways, using sed(GNU)

$ echo $s | sed 's/http:\/\///;s|\/.*||'
example.com

use awk

$ echo $s| awk '{gsub("http://|/.*","")}1'
example.com
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • Your method doesn't work! echo http://example.com/index.php | sed -r 's/http:\/\/|\///g' gives output example.comindex.php and NOT example.com on cygwin. please post a method that works – Ben Smith Mar 23 '10 at 03:11
  • 3
    my method doesn't work because your sample url is different !! and you did not provide more info on what type of urls you want to parse !!. you should write your question clearly providing input examples and describe what output you want next time! – ghostdog74 Mar 23 '10 at 03:31
  • 2nd line seems to be incorrect. I copypasted the 2 first lines to my ubuntu shell and got _http://example.com/index.php*_ – jpeltoniemi Jun 25 '12 at 16:58
3

The following will output "example.com":

URI="http://user@example.com/foo/bar/baz/?lala=foo" 
ruby -ruri -e "p URI.parse('$URI').host"

For more info on what you can do with Ruby's URI class you'd have to consult the docs.

Michael Kohl
  • 66,324
  • 14
  • 138
  • 158
1

One solution that would cover for more cases would be based on sed regexps:

echo http://example.com/index.php | sed -e 's#^https://\|^http://##' -e 's#:.*##' -e 's#/.*##'

That would work for URLs like: http://example.com/index.php, http://example.com:4040/index.php, https://example.com/index.php

1

Please note that extracting domain-name only from a URL is a bit tricky because domain name place in the hostname depends on the country (or more generally on the TLD) being used.

eg. for Argentina: www.personal.com.ar Domain name is personal.com.ar, not com.ar because this TLD uses subzones to specify type of organization.

The tool that I've found to manage well these cases is tldextract

So based on the FQDN (host part of the URL), you would get the domain reliably this way:

tldextract personal.com.ar | cut -d " " -f 2,3 | sed 's/ /./'

(the other answers to get the FQDN out of the URL are good and should be used)

hope this helps :) and thanks to tripleee !

Phil L.
  • 2,637
  • 1
  • 17
  • 11
  • There is no "above" or "below"; your answer could be first or last or in the middle depending on each visitor's display preferences. This is not a "corner case" but rather a central case where some popular global TLDs are common but actually the corner case. Nevertheless, +1 – tripleee Dec 29 '22 at 11:05
0

With Ruby you can use the Domainatrix library / gem

http://www.pauldix.net/2009/12/parse-domains-from-urls-easily-with-domainatrix.html

require 'rubygems'
require 'domainatrix'
s = 'http://www.champa.kku.ac.th/dir1/dir2/file?option1&option2'
url = Domainatrix.parse(s)
url.domain
=> "kku"

great tool! :-)

Tilo
  • 1
0

Here's the node.js way, it works with or without ports and deep paths:

//get-hostname.js
'use strict';

const url = require('url');
const parts = url.parse(process.argv[2]);

console.log(parts.hostname);

Can be called like:

node get-hostname.js http://foo.example.com:8080/test/1/2/3.html
//foo.example.com

Docs: https://nodejs.org/api/url.html

chovy
  • 72,281
  • 52
  • 227
  • 295
0

Pure Bash implementation without any sub-shell or sub-process:

# Extract host from an URL
#   $1: URL
function extractHost {
    local s="$1"
    s="${s/#*:\/\/}" # Parameter Expansion & Pattern Matching
    echo -n "${s/%+(:*|\/*)}"
}

E.g. extractHost "docker://1.2.3.4:1234/a/v/c" will output 1.2.3.4

vbem
  • 2,115
  • 2
  • 12
  • 9
0

Using bash built-in regex (no external utilities needed):

#!/usr/bin/env bash

url=https://stackoverflow.com/questions/2497215/how-to-extract-domain-name-from-url

if [[ $url =~ ^(https?://[^/]+) ]]; then
  host="${BASH_REMATCH[1]}"
  echo "HOST: $host"
else
  echo "Invalid URL $url"
  exit 1
fi

# OUTPUT
# HOST: https://stackoverflow.com
See also:
ccpizza
  • 28,968
  • 18
  • 162
  • 169