19

I need to go through a large list of string url's and extract the domain name from them.

For example:

http://www.stackoverflow.com/questions would extract www.stackoverflow.com

I originally was using new URL(theUrlString).getHost() but the URL object initialization adds a lot of time to the process and seems unneeded.

Is there a faster method to extract the host name that would be as reliable?

Thanks

Edit: My mistake, yes the www. would be included in domain name example above. Also, these urls may be http or https

cottonBallPaws
  • 21,220
  • 37
  • 123
  • 171
  • You could use a regular expression or some simple string manipulation to extract it, i.e. remove the leading `http://` or `https://` and then take everything up to the first `/` or `:` (port - not sure if you want this). However I'm not sure if this covers all cases (hence the comment rather than answer) – John Pickup Jan 28 '11 at 08:09

8 Answers8

40

If you want to handle https etc, I suggest you do something like this:

int slashslash = url.indexOf("//") + 2;
domain = url.substring(slashslash, url.indexOf('/', slashslash));

Note that this is includes the www part (just as URL.getHost() would do) which is actually part of the domain name.

Edit Requested via comments

Here are two methods that might be helpful:

/**
 * Will take a url such as http://www.stackoverflow.com and return www.stackoverflow.com
 * 
 * @param url
 * @return
 */
public static String getHost(String url){
    if(url == null || url.length() == 0)
        return "";

    int doubleslash = url.indexOf("//");
    if(doubleslash == -1)
        doubleslash = 0;
    else
        doubleslash += 2;

    int end = url.indexOf('/', doubleslash);
    end = end >= 0 ? end : url.length();

    int port = url.indexOf(':', doubleslash);
    end = (port > 0 && port < end) ? port : end;

    return url.substring(doubleslash, end);
}


/**  Based on : http://grepcode.com/file/repository.grepcode.com/java/ext/com.google.android/android/2.3.3_r1/android/webkit/CookieManager.java#CookieManager.getBaseDomain%28java.lang.String%29
 * Get the base domain for a given host or url. E.g. mail.google.com will return google.com
 * @param host 
 * @return 
 */
public static String getBaseDomain(String url) {
    String host = getHost(url);

    int startIndex = 0;
    int nextIndex = host.indexOf('.');
    int lastIndex = host.lastIndexOf('.');
    while (nextIndex < lastIndex) {
        startIndex = nextIndex + 1;
        nextIndex = host.indexOf('.', startIndex);
    }
    if (startIndex > 0) {
        return host.substring(startIndex);
    } else {
        return host;
    }
}
extraneon
  • 23,575
  • 2
  • 47
  • 51
aioobe
  • 413,195
  • 112
  • 811
  • 826
  • This is nearly 12x as fast as the new URL().getHost() method I was using and it appears work just as well. I am going to try to get a good regex pattern and run some more benchmarks before I mark this as answered, but that is a good improvement! – cottonBallPaws Jan 28 '11 at 17:57
  • Also, I had to change your code slightly to add a check in-case that trailing / did not exist. – cottonBallPaws Jan 28 '11 at 17:59
  • Ah, yes, good point! Feel free to edit my answer and include your improvements! – aioobe Jan 28 '11 at 18:01
  • littleFluffyKitty, it would have been nice if you updated his answer. – slashline Aug 25 '11 at 14:55
  • Fails for the URL reference `#//x/` and the URLs `.//x/` and `?x=//y/`. – Mike Samuel Mar 11 '12 at 00:03
  • Stumbled uppon this via google. While evaluating this code snippets I figured out that port definitions in URL won't be handled correctly: http://localhost:4394/ will result in "localhost:4394" in getHost() – gue Dec 15 '13 at 20:34
  • Port is now also removed from the host url. And thanks for the code; it works like a charm in my junit tests and it is also easy to understand. – extraneon Jan 12 '14 at 10:18
  • 3
    @aioobe getBaseDomain fails on urls like amazon.co.uk – gladiator Jan 21 '15 at 10:45
9

You want to be rather careful with implementing a "fast" way unpicking URLs. There is a lot of potential variability in URLs that could cause a "fast" method to fail. For example:

  • The scheme (protocol) part can be written in any combination of upper and lower case letters; e.g. "http", "Http" and "HTTP" are equivalent.

  • The authority part can optionally include a user name and / or a port number as in "http://you@example.com:8080/index.html".

  • Since DNS is case insensitive, the hostname part of a URL is also (effectively) case insensitive.

  • It is legal (though highly irregular) to %-encode unreserved characters in the scheme or authority components of a URL. You need to take this into account when matching (or stripping) the scheme, or when interpreting the hostname. An hostname with %-encoded characters is defined to be equivalent to one with the %-encoded sequences decoded.

Now, if you have total control of the process that generates the URLs you are stripping, you can probably ignore these niceties. But if they are harvested from documents or web pages, or entered by humans, you would be well advised to consider what might happen if your code encounters an "unusual" URL.


If your concern is the time taken to construct URL objects, consider using URI objects instead. Among other good things, URI objects don't attempt a DNS lookup of the hostname part.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
2

I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first "://" (or index 0 if there's no "://" contained) and the first subsequent "/" (or index String.length() if there's no subsequent "/"). The remaining, preceding "www(_)*." bit is chopped off. I'm sure there'll be cases where this won't be good enough but it should be good enough in most cases!

I read here that the java.net.URI class could do this (and was preferred to the java.net.URL class) but I encountered problems with the URI class. Notably, URI.getHost() gives a null value if the url does not include the scheme, i.e. the "http(s)" bit.

/**
 * Extracts the domain name from {@code url}
 * by means of String manipulation
 * rather than using the {@link URI} or {@link URL} class.
 *
 * @param url is non-null.
 * @return the domain name within {@code url}.
 */
public String getUrlDomainName(String url) {
  String domainName = new String(url);

  int index = domainName.indexOf("://");

  if (index != -1) {
    // keep everything after the "://"
    domainName = domainName.substring(index + 3);
  }

  index = domainName.indexOf('/');

  if (index != -1) {
    // keep everything before the '/'
    domainName = domainName.substring(0, index);
  }

  // check for and remove a preceding 'www'
  // followed by any sequence of characters (non-greedy)
  // followed by a '.'
  // from the beginning of the string
  domainName = domainName.replaceFirst("^www.*?\\.", "");

  return domainName;
}
Community
  • 1
  • 1
Adil Hussain
  • 30,049
  • 21
  • 112
  • 147
1

Try method : getDomainFromUrl() in that class

package com.visc.mobilesecurity.childrencare.utils;

import android.content.Context;

import com.visc.mobilesecurity.antitheft.backwardcompatibility.FroyoSupport;
import com.visc.mobilesecurity.antitheft.util.AntiTheftUtils;
import com.visc.mobilesecurity.constant.Key;
import com.visc.mobilesecurity.util.Prefs;

import org.json.JSONObject;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;

/**
 * Created by thongnv12 on 3/9/2018.
 */

public class ChildcareUtils {

    public static final String[] NATION_DOMAIN = {"af", "ax", "al", "dz", "as", "ad", "ao", "ai", "aq", "ag", "ar", "am", "aw", "ac", "au", "at", "az", "bs", "bh", "bd", "bb", "eus",
            "by", "be", "bz", "bj", "bm", "bt", "bo", "bq", "ba", "bw", "bv", "br", "io", "vg", "bn", "bg", "bf", "mm", "bi", "kh", "cm", "ca", "cv", "cat", "ky", "cf", "td", "cl",
            "cn", "cx", "cc", "co", "km", "cd", "cg", "ck", "cr", "ci", "hr", "cu", "cw", "cy", "cz", "dk", "dj", "dm", "do", "tl", "ec", "eg", "sv", "gq", "er", "ee", "et", "eu",
            "fk", "fo", "fm", "fj", "fi", "fr", "gf", "pf", "tf", "ga", "gal", "gm", "ps", "ge", "de", "gh", "gi", "gr", "gl", "gd", "gp", "gu", "gt", "gg", "gn", "gw", "gy", "ht",
            "hm", "hn", "hk", "hu", "is", "in", "id", "ir", "iq", "ie", "im", "il", "it", "jm", "jp", "je", "jo", "kz", "ke", "ki", "kw", "kg", "la", "lv", "lb", "ls", "lr", "ly",
            "li", "lt", "lu", "mo", "mk", "mg", "mw", "my", "mv", "ml", "mt", "mh", "mq", "mr", "mu", "yt", "mx", "md", "mc", "mn", "me", "ms", "ma", "mz", "mm", "na", "nr", "np",
            "nl", "nc", "nz", "ni", "ne", "ng", "nu", "nf", "kp", "mp", "no", "om", "pk", "pw", "ps", "pa", "pg", "py", "pe", "ph", "pn", "pl", "pt", "pr", "qa", "ro", "ru", "rw",
            "re", "bq", "bl", "sh", "kn", "lc", "mf", "fr", "pm", "vc", "ws", "sm", "st", "sa", "sn", "rs", "sc", "sl", "sg", "bq", "sx", "sk", "si", "sb", "so", "so", "za", "gs",
            "kr", "ss", "es", "lk", "sd", "sr", "sj", "sz", "se", "ch", "sy", "tw", "tj", "tz", "th", "tg", "tk", "to", "tt", "tn", "tr", "tm", "tc", "tv", "ug", "ua", "ae", "uk",
            "us", "vi", "uy", "uz", "vu", "va", "ve", "vn", "wf", "eh", "zm", "zw"};


    public static boolean isInNationString(String str) {
        for (int index = 0; index < NATION_DOMAIN.length; index++) {
            if (NATION_DOMAIN[index].equals(str)) {
                return true;
            }
        }
        return false;
    }


    public static String getDomainFromUrl(String urlStr) {
        try {
            String result = null;
//            URL url = new URL(urlStr);
//            result = url.getHost();
//            return result;
//
            // for test
            // check dau cach
            if (urlStr.contains(" ")) {
                return null;
            }
            // replace
            urlStr = urlStr.replace("https://", "");
            urlStr = urlStr.replace("http://", "");
            urlStr = urlStr.replace("www.", "");
            //
            String[] splitStr = urlStr.split("/");

            String domainFull = splitStr[0];

            String[] splitDot = domainFull.split("\\.");

            if (splitDot.length < 2) {
                return null;
            }

            String nationStr = splitDot[splitDot.length - 1];

            if (isInNationString(nationStr)) {
                if (splitDot.length < 4) {
                    result = domainFull;
                } else {
                    StringBuilder strResult = new StringBuilder();
                    int lengthDot = splitDot.length;
                    strResult.append(splitDot[lengthDot - 3]).append(".");
                    strResult.append(splitDot[lengthDot - 2]).append(".");
                    strResult.append(splitDot[lengthDot - 1]);
                    result = strResult.toString();
                }

            } else {
                if (splitDot.length < 3) {
                    result = domainFull;
                } else {
                    StringBuilder strResult = new StringBuilder();
                    int lengthDot = splitDot.length;
                    strResult.append(splitDot[lengthDot - 2]).append(".");
                    strResult.append(splitDot[lengthDot - 1]);
                    result = strResult.toString();
                }
            }
            return result;
        } catch (Exception ex) {
            ex.printStackTrace();
            return null;
        }

    }
}
Mr.Thong
  • 11
  • 2
1

There is only an another way to get the host

private String getHostName(String hostname) {
    // to provide faultproof result, check if not null then return only hostname, without www.
    if (hostname != null) {
        return hostname.startsWith("www.") ? hostname.substring(4) : getHostNameDFExt(hostname);
    }
    return hostname;
}

private String getHostNameDFExt(String hostname) {

    int substringIndex = 0;
    for (char character : hostname.toCharArray()) {
        substringIndex++;
        if (character == '.') {
            break;
        }
    }

    return hostname.substring(substringIndex);

}

Now we've to pass the hostname in function after extract from URL

URL url = new URL("https://www.facebook.com/");
String hostname = getHostName(ur.getHost());

Toast.makeText(this, hostname, Toast.LENGTH_SHORT).show();

The output would be: "facebook.com"

Ali Azaz Alam
  • 1,782
  • 1
  • 16
  • 27
  • Why do you need `getHostNameDFExt`? Just return the `hostname` symmetricly to what you need in the `hostname.startsWith("www.")` `true` case. – Johnny Jul 12 '20 at 11:17
0

You could write a regexp? http:// is always the same, and then match everything until you get the first '/'.

Nanne
  • 64,065
  • 16
  • 119
  • 163
  • A regular expression is *not* likely to be the fastest way. – aioobe Jan 28 '11 at 08:10
  • might be faster then builden the url though. Why do you think this? 'gut feeling' or do you have any references? – Nanne Jan 28 '11 at 08:15
  • The reg-exp engine produces an FSM and updates some state for each character. I doubt that it's faster, but if you benchmark it, please let us know about your findings. – aioobe Jan 28 '11 at 08:20
0

Assuming that they're all well-formed URLs, but you dont' know whether they'll be http://, https://, etc.


int start = theUrlString.indexOf('/');
int start = theUrlString.indexOf('/', start+1);
int end = theUrlString.indexOf('/', start+1);
String domain = theUrlString.subString(start, end);

Jason LeBrun
  • 13,037
  • 3
  • 46
  • 42
0

You could try to use regular expressions.

http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

Here is a question about extracting domain name with regular expressions in Java:

Regular expression to retrieve domain.tld

Community
  • 1
  • 1
Martin
  • 968
  • 3
  • 10
  • 19