1

Here I have all the sites crawled from several different navigation sites, some of them are duplicated, I mean, for example:

http://www.hao123.com/index.htm AND http://www.hao123.com

These are two sites with the same content, of course there are other cases such as a missing slash. By using the URL alone, I still take them as two different sites.

My question is: Is there any efficient ways to recognize them as one site? Thanks!

Kara
  • 6,115
  • 16
  • 50
  • 57
Stepin2
  • 49
  • 8
  • 1
    Why dont just check if two string (site name) starts with same domain name or whether its a sub string of another? – Ajinkya Jan 07 '14 at 13:59
  • You could match content length – sanket Jan 07 '14 at 13:59
  • 3
    Those are 2 different URLs, not 2 different sites. – Raul Rene Jan 07 '14 at 14:00
  • @sanket they are different lengths, how would that help?!?! – M21B8 Jan 07 '14 at 14:00
  • @sanket - Content length wont be helpful here... – TheLostMind Jan 07 '14 at 14:00
  • @Karna Sometimes I'll take both the sites, for example: www.google.co.uk AND http://www.google.com/imghp – Stepin2 Jan 07 '14 at 14:00
  • I believe the content can change sometimes, so the length can change. – Stepin2 Jan 07 '14 at 14:03
  • Is this for checking a site or navigating to one? – Fraser Price Jan 07 '14 at 14:04
  • @FootStep: Then they can have different contents. – Ajinkya Jan 07 '14 at 14:05
  • I can't conceive of a general solution to this problem. You probably just need to guess the best you can. Even the solution that suggests finding the distance between the page contents is not fool-proof considering how many pages won't even produce the same content each time you visit them. – Cruncher Jan 07 '14 at 14:12
  • You don't know for sure they are the same, you can only guess they might be. I would download them both and check if they have the same contents. – Peter Lawrey Jan 07 '14 at 14:17
  • @TheLostMind I am a\talking about the content length when you make a URL Connection to the site. Read the entire content of the page. – sanket Jan 07 '14 at 15:47
  • @sanket - Thats not efficient... To check if 2 URLs are equal, we should not check the contents of the page...We dont need internet for this.. all we need is some smart String comparison method... – TheLostMind Jan 08 '14 at 05:35

3 Answers3

2

There's no foolproof way that I know of to do this.

Having said that, one approach could be to load the content from each URL, then apply the Levenshtein distance algorithm to all the pages that come under the same domain name. You could then set a threshold value as to how "similar" the content is before it's considered the same (as if the content changes slightly I'd imagine the bulk of it would still be identical.) Something like 10% of the page length could be a good starting point for that value.

This might be relatively slow depending on how many sites you have, but would take into account slight differences in content on each load that a simple hash or length calculation wouldn't.

To make this a bit more reliable you could check that certain things are identical (or not) across loads that you expect to be - for instance the title of the page.

Michael Berry
  • 70,193
  • 21
  • 157
  • 216
  • Thanks. I have tried to use the title of the page to do the job, but I don't know if it's sufficient. I had wished for a easier way, since you said that there is no foolproof way, I can try the method you talked about. Thanks again! – Stepin2 Jan 07 '14 at 14:18
  • 1
    I would only difference visible content because a lot of included page content is from JavaScript libraries or blog templates. There are ways to construct a page but a simple scan of the content tags with the Levenshtein distance algorithm should do the trick. – Michael Shopsin Jan 07 '14 at 14:41
  • @MichaelShopsin Sure, you could improve on this method relatively easily by using a library that just took content that was actually displayed. – Michael Berry Jan 07 '14 at 14:47
  • @berry120 agreed, I've some background on web scraping and determining what is shown is not that hard. Web scraping is hard if you want to understand the content which is not what FootStep wants to do. – Michael Shopsin Jan 07 '14 at 17:59
1

Use regex to parse out the domain names

Example snippet:

String a = "http://www.google.com";

String tempString = a.substring(a.indexOf(".")+1, a.length()); // gets rid of everything before the first dot

String domainString = tempString.substring(0, tempString.indexOf(".")); // grabs everything before the second dot

System.out.println(domainString);

Outputs google

EDIT :

Here's a sample stand-alone demo that can deal with more complex domain structures and extract individual components.

You can add more domain test-cases inside the main method in the source below to test various domains but currently it's testing the following ones:

http://www.google.com/

ftp://www.google.com

http://google.com/

google.com

localhost:80

Here's the source (Pardon my lazy spaghetti):

package domain.parser.test;

public class Parseromatic {

    public static void main(String[] args) {

        Parseromatic parser = new Parseromatic();
        parser.extract("http://www.google.com/");
        parser.extract("ftp://www.google.com");
        parser.extract("http://google.com/");
        parser.extract("google.com");
        parser.extract("localhost:80");

    }

    public void extract(String a){

        if(a.contains(".")){ // Initial outOfBounds proof check in cases like (http://localhost:80)
            String leadingString = a.substring(0, a.indexOf(".")); // First portion of the URL

            boolean hasProto = protocol(leadingString);

            // Now lets grab the rest
            String trailingString = a.substring(a.indexOf(".")+1, a.length());

            // Check if it contains a forward-slash
            if(trailingString.contains("/")){

                // We snip out everything before the slash

                String middleString = snipOffPages(trailingString);

                // Now we're only left with the domain related things

                // Check if subdomain was left in the leadingString

                if(middleString.contains(".")){
                    // Yep so lets deal with that

                    if(hasProto){ // If it had a protocol
                        System.out.println("Subdomain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Subdomain: "+leadingString);
                    }

                    // Now let's split up the rest

                    String[] split1 = middleString.split("\\.");

                    System.out.println("Domain: "+split1[0]);

                    // Check for port
                    if (split1[1].contains(":")){

                        // Assuming port is specified

                        String[] split2 = split1[1].split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+split1[1]);

                        System.out.println("Port: N/A");
                    }


                } else {

                    // No subdomain was present

                    System.out.println("Subdomain: N/A");

                    if(hasProto){ // If it had a protocol
                        System.out.println("Domain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Domain: "+leadingString);
                    }

                    // Check for port
                    if (middleString.contains(":")){

                        // Assuming port is specified

                        String[] split2 = middleString.split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+middleString);

                        System.out.println("Port: N/A");
                    }

                }


            } else { // We assume it only contains domain related things

                if(trailingString.contains(".")){
                    // Yep so lets deal with that

                    if(hasProto){ // If it had a protocol
                        System.out.println("Subdomain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Subdomain: "+leadingString);
                    }

                    // Now let's split up the rest

                    String[] split1 = trailingString.split("\\.");

                    System.out.println("Domain: "+split1[0]);

                    // Check for port
                    if (split1[1].contains(":")){

                        // Assuming port is specified

                        String[] split2 = split1[1].split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+split1[1]);

                        System.out.println("Port: N/A");
                    }


                } else {

                    // No subdomain was present

                    System.out.println("Subdomain: N/A");

                    if(hasProto){ // If it had a protocol
                        System.out.println("Domain: "+leadingString.substring(leadingString.indexOf("://")+3, leadingString.length()));
                    } else { // If it didn't have a protocol
                        System.out.println("Domain: "+leadingString);
                    }

                    // Check for port
                    if (trailingString.contains(":")){

                        // Assuming port is specified

                        String[] split2 = trailingString.split(":");

                        System.out.println("Top-Domain: "+split2[0]);

                        System.out.println("Port: "+split2[1]);

                    } else {

                        // Assuming no port specified

                        System.out.println("Top-Domain: "+trailingString);

                        System.out.println("Port: N/A");
                    }

                }

            }

        } else {

            // Assuming only one level exists

            boolean hasProto = protocol(a);

            // Check if protocol was present
            if(hasProto){
                String noProto = a.substring(a.indexOf("://")+3, a.length());

                // If some pages or something is specified
                if(noProto.contains("/")){
                    noProto = snipOffPages(noProto);
                }

                // Check for port
                if(noProto.contains(":")){

                    String[] split1 = noProto.split(":");

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+split1[0]);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: "+split1[1]);

                } else {

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+noProto);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: N/A");

                }

            } else {

                // If some pages or something is specified
                if(a.contains("/")){
                    a = snipOffPages(a);
                }

                // Check for port
                if(a.contains(":")){

                    String[] split1 = a.split(":");

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+split1[0]);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: "+split1[1]);

                } else {

                    System.out.println("Subdomain: N/A");
                    System.out.println("Domain: "+a);
                    System.out.println("Top-Domain: N/A");
                    System.out.println("Port: N/A");

                }

            }



        }

        System.out.println(); // Cosmetic empty line, can ignore


    }

    public String snipOffPages(String a){
        return a.substring(0,a.indexOf("/"));
    }

    public boolean protocol(String a) {
        // Protocol extraction
        if(a.contains("://")){ // Check for existance of protocol declaration
            String protocolString = a.substring(0, a.indexOf("://"));
            System.out.println("Protocol: "+protocolString);
            return true;
        }
        else {
            System.out.println("Protocol: N/A");
            return false;
        }
    }

}

And for the specified domains above it outputs:

Protocol: http
Subdomain: www
Domain: google
Top-Domain: com
Port: N/A

Protocol: ftp
Subdomain: www
Domain: google
Top-Domain: com
Port: N/A

Protocol: http
Subdomain: N/A
Domain: google
Top-Domain: com
Port: N/A

Protocol: N/A
Subdomain: N/A
Domain: google
Top-Domain: com
Port: N/A

Protocol: N/A
Subdomain: N/A
Domain: localhost
Top-Domain: N/A
Port: 80
Ceiling Gecko
  • 3,104
  • 2
  • 24
  • 33
  • Thanks, but it's not in common use, I think. There can be complex expressions for a site with more than three-level domain. – Stepin2 Jan 07 '14 at 14:23
  • You could just amp up the regex, give me a moment and I'll throw together a more complex solution. – Ceiling Gecko Jan 07 '14 at 14:49
  • There, you could use something similar to the demo above. – Ceiling Gecko Jan 08 '14 at 10:30
  • What I'm doing is not that clearly defined, sometimes I'll still need several pages from one site and take them as several sites, like Google News and Google Scholar. But I believe it will be helpful for other use, thanks anyway! – Stepin2 Jan 08 '14 at 12:01
0

Best way is probably using regular expressions to get the domain name and keep a list of all domain names. Whenever you check for a new URL check against your list of 'visited' domain names too. Here is an older question about how to get the domain name:

Get domain name from given url

Community
  • 1
  • 1
Diana Amza
  • 303
  • 1
  • 2
  • 12