4

I have a bunch of txt files that contains 300k lines. Each line has a URL. E.g. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718

In some string[] array I have a list of web-sites

amazon.com
google.com
ieee.org
...

I need to check whether that URL contains one of web-sites and update some counter that corresponds to certain web-site?

For now I'm using contains method, but it is very slow. There are ~900 records in array, so Worst case is 900*300K(for 1 file). I believe, that indexOf will be slow as well.

Can someone help me with faster approach? Thank you in advance

Sparrow_ua
  • 664
  • 1
  • 11
  • 28
  • 1
    Show us your current code. – Selman Genç Feb 25 '14 at 16:43
  • This is an easy candidate for parallelization - have you looked into Parallel.For or similar? – RB. Feb 25 '14 at 16:48
  • Also, are you only going to be searching for the hostname? If so, there is a way to speed it up. – RB. Feb 25 '14 at 16:53
  • 3
    Testing URI's with `contains` is a fundamentally broken idea anyway. What about `google.com.example.com`? It looks like you should really parse all the URL's (really URI's though, right?) extract the relevant part, and look it up in a dictionary. – harold Feb 25 '14 at 16:54

6 Answers6

3

Good solution would leverage hashing. My approach would be following

  1. Hash all your known hosts (the string[] collection that you mention)
  2. Store the hash in a List<int> (hashes.Add("www.ieee.com".GetHashCode())
  3. Sort the list (hashes.Sort())
  4. When looking up a url:
    1. Parse out host name from the url (get ieee.com from http://www.ieee.com/...). You can use new Uri("http://www.ieee.com/...").Host to get www.ieee.com.
    2. Preprocess it to always expect same case. Use lower case (if you have http://www.IEee.COM/ take www.ieee.com)
    3. Hash parsed host name, and look for it in the hashes list. Use BinarySearch method to find the hash.
    4. If the hash exists, then you have this host in your list

Even faster, and memory efficient way is to use Bloom filters. I suggest you read about them on wikipedia, and there's even a C# implementation of bloom filter on CodePlex. Of course, you need to take into account that bloom filter allows false positive results (it can tell you that a value is in a collection even though it's not), so it's used for optimization only. It does not tell you that something is not in a collection if it is really not.


Using a Dictionary<TKey, TValue> is also an option, but if you only need to count number of occurrences, it's more efficient to maintain collection of hashes yourself.

Nikola Radosavljević
  • 6,871
  • 32
  • 44
  • "more efficient to maintain collection of hashes yourself." Doubtful. There is a non-zero chance of hash collision, so your custom collection of hashes will need some way to resolve collisions. I'm sure you could do better than `Dictionary`, but you're going to spend a lot of time coding it up. – Jim Mischel Feb 25 '14 at 20:18
  • Also, Bloom filter isn't a particularly good choice because all it does is tell you if an item exists. It doesn't keep a count, so you'll have to maintain a separate count somewhere else, indexed by host name. Using a Bloom filter seems like a whole lot of extra work for no gain. – Jim Mischel Feb 25 '14 at 20:19
  • I'm not sure all that is correct. I haven't done an experiment and therefore can't claim, but I really believe that there is 0 chance of hash collision of data set consisting of alphanumeric character string not longer than 50 bytes. As for keeping count, I don't see a problem in storing count in separate `List` with same number of elements as hash list and corresponding element indices between two lists. This IS much more complex than using `Dictionary`, I was trying to provide an answer about most efficient approach I could think of in few moments. – Nikola Radosavljević Feb 26 '14 at 11:02
  • There are only 2^32 (4 billion and change) 32-bit hash codes. There are over 8 billion possible 7-character strings using just the letters A-Z. So your first assertion is wrong: the chance of hash collision is quite high. – Jim Mischel Feb 26 '14 at 12:51
  • You're right. Thanks for correcting me. Unfortunately that would lead to implementing manual collision resolution, which is, I agree too much work for simple feat like this (unless one is desperate for minor performance gain). – Nikola Radosavljević Feb 26 '14 at 13:24
1

Create a Dictionary of domain to counter.

For each URL, extract the domain (I'll leave that part to you to figure out), then look up the domain in the Dictionary and increment the counter.


I assume we're talking about domains since this is what you showed in your array as examples. If this can be any part of the URL instead, storing all your strings in a trie-like structure could work.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
0

You can read this question, the answers will be help you:

High performance "contains" search in list of strings in C#

Community
  • 1
  • 1
Only a Curious Mind
  • 2,807
  • 23
  • 39
  • But he doesn't know the full string - he is trying to find strings which contain a substring. – RB. Feb 25 '14 at 16:51
0

Well in a sort of similar need, though with indexof, I achieved a huge performance improvement with a simple loop

as in something like

int l = url.length;
int position = 0;
while (position < l)
{
   if (url[i] == website[0])
   {
      //test rest of web site from position in an other loop
      if (exactMatch(url,position, website))
   }
}

Seems a bit wrong but in extreme cases searching for a set of strings (about 10) in a large structured (1.2Mb) file (so regex was out), I went from 3 minutes, to < 1 second.

Tony Hopkinson
  • 20,172
  • 3
  • 31
  • 39
0

Your problem as you describe it should not involve searching for substrings at all. Split your source file up into lines (or read it in line by line) which you already know will each contain a URL, and run it through some function to extract the domain name, then compare this with some fast access tally of your target domains such as a Dictionary<string, int>, incrementing as you go, e.g.:

var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
    var domain = naiveDomainExtractor(line);
    if(tally.ContainsKey(domain)) tally[domain]++;
}

...which takes a third of a second on my not particularly speedy machine, including generation of test data.

Admittedly your domain extractor maybe a bit more sophisticated but it will probably not be very processor intensive, and if you've got multiple cores at your disposal you can speed things up further by using a ConcurrentDictionary<string, int> and Parallel.ForEach.

stovroz
  • 6,835
  • 2
  • 48
  • 59
0

You'd have to test the performance but you might try converting the urls to the actual System.Uri object.

Store the list of websites as a HashSet<string> - then use the HashSet to look up the Uri's Host:

IEnumerable<Uri> inputUrls = File.ReadAllLines(@"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));
McAden
  • 13,714
  • 5
  • 37
  • 63