How to remove duplicate domains from a URL list using javascript

Question

I am stuck at a rather simple problem - removing duplicate domains from a list of URL's, using javascript.

Here's what I am currently doing: I have an array called 'list' which has the list of url's. I work on that to extract the domains, and put them in a new array called 'domain'.

Then I use two for loops to go through the entire list and check for duplicate domains. If the domains match, I splice the duplicate one out. But it seems to be removing too many, and I am pretty sure I am doing something wrong. Can somebody tell me what I am doing wrong, or suggest a simpler/better way of doing it?

for (i=0; i<list.length; i++) {

    for (j=i+1; j<list.length; j++) {

        if (domain[i] == domain[j]) {

            console.log('REMOVING:');
            console.log(i + '. ' + list2[i]);
            console.log(j + '. ' + list2[j]);
            console.log(domain[i]);
            console.log(domain[j]);

            list.splice(j,1);

        }
    }
}

This is not a 'how to remove duplicates from an array' question. As I have a list of URL's, and need to check for - and remove, only the duplicate 'domains'. So suppose I have 4 URL's from youtube, I need to keep only the first one and remove the rest.

Possible duplicate of [Unique values in an array](http://stackoverflow.com/questions/1960473/unique-values-in-an-array) — koffeinfrei, Aug 04 '16 at 19:05
hi koffeinfrei, my question is a little different - as I have two arrays - one with the url list and one with the domains. I need to check for duplicates from the domain list, and remove them from the url's list. — user3001859, Aug 04 '16 at 19:11
I think I found the error - I was splicing only one of the arrays and not both of them. Was only splicing the URL array, and not the domain array. But I would like to leave the question open in case somebody has a more 'elegant' solution for this problem - removing duplicate domains from a list of URL's (using JS) — user3001859, Aug 04 '16 at 19:56

Rob M. · Answer 1 · 2016-08-04T19:01:18.120

3

ES5: filter the array and only include if the current item's index is equal to its index in the array:

list.filter(function(elem, pos, arr) {
   return arr.indexOf(elem) === pos;
});

ES6: use a Set

const uniqueDomains = [ ...new Set(list) ];

or if you can't use the spread operator:

new Set(list).toJSON()

edited Aug 04 '16 at 19:01

answered Aug 04 '16 at 18:58

Rob M.

35,491
6
51
50

score 2 · Answer 2 · answered Aug 04 '16 at 21:23

Try to get rid of the domains array. Instead build a map of already "used" domains:

var urls = [
  'http://example.org/page-1.html',
  'http://example.org/page-2.html',
  'http://google.com/search.html',
  'http://mozilla.com/foo.html',
];

var domains = {};
var uniqueUrls = urls.filter(function(url) {
  // whatever function you're using to parse URLs
  var domain = extractDomain(url);
  if (domains[domain]) {
    // we have seen this domain before, so ignore the URL
    return false;
  }
  // mark domain, retain URL
  domains[domain] = true;
  return true;
});

console.log(uniqueUrls);

score 0 · Answer 3 · answered Aug 04 '16 at 19:01

You can let an object handle the checking for you.

var a = [];

a.push('http://test')
a.push('http://that');
a.push('http://that');
a.push('http://that');

var o = {}

for(var ii = 0; ii < a.length; ii++){
    o[a[ii]] = o[a[ii]]
}

var nA = [];

for (var k in o) {
    nA.push(k);
}

score 0 · Answer 4 · edited Aug 04 '16 at 20:37

0

If you are able to use the Undescore.js library, it's as simple as

yourArray = _.uniq(yourArray);

http://underscorejs.org/#uniq

edited Aug 04 '16 at 20:37

War10ck

12,387
7
41
54

answered Aug 04 '16 at 19:04

Kalman

8,001
1
27
45

Nielsvh · Answer 5 · 2016-08-04T20:36:50.480

The best way to remove duplicates is to use a map. The example has an array of URIs with some duplicates. First insert the strings into an object, then iterate over the object to create an array. Boom, no duplicates.

function getHostName(url) {
    var match = url.match(/:\/\/(www[0-9]?\.)?(.[^/:]+)/i);
    if (match != null && match.length > 2 && typeof match[2] === 'string' && match[2].length > 0) {
    return match[2];
    }
    else {
        return null;
    }
}

var uris = ["http://foo.org/barbar","http://www.bar.com/foo/bar/bar.html","http://foo.bar/lorem/","http://foo.org","https://bar.bar","http://foo.org","http://bar.bar"];
var urisObj = {};
for(var i = 0;i<uris.length;i++){
  urisObj[getHostName(uris[i])] = getHostName(uris[i]);
}

uris = Object.keys(urisObj).map(function(x) { return urisObj[x];});

console.log(uris);

Edit:

Using http://www.primaryobjects.com/2012/11/19/parsing-hostname-and-domain-from-a-url-with-javascript/ to get the host name from a string.

The URL's in my list don't just include the domains, but pages too.. Like foo.com/page1.html, bar.com/whatever/p1.html. I need to check for duplicates in domains alone. [Right now I have two arrays, one with full URLs and one with domains alone] — user3001859, Aug 04 '16 at 19:17

score 0 · Answer 6 · answered Aug 04 '16 at 19:19

If you want to do it using your original way (or very similar to it), instead of going up the array (with i++) - go down the array instead. As in the following code,

var list = ["abc", "cba", "abc", "abc", "abc", "abc"];

for (var i = list.length - 1; i >= 0; i--) {

  for (var j = i-1; j >= 0; j--) {

    if (list[i] == list[j]) {

        console.log('REMOVING:');
        console.log(i + '. ' + list[i]);
        console.log(j + '. ' + list[j]);
        console.log(list[i]);
        console.log(list[j]);

        list.splice(i, 1);

    }
  }
}

console.log(list);

In general, removing elements from an array and going up at the same time is a horrible idea. Think about it. :) — Kalman, Aug 04 '16 at 19:25

How to remove duplicate domains from a URL list using javascript

6 Answers6