1

I wrote a program that spawns a user chosen number of threads that each crawl the internet in search of some data, you might call it a webcrawler.

The bottleneck of the program should definitely be network capacity since any given thread spends the majority of it's time waiting on network requests:

WebClient client = new WebClient();
string url = "http://averynice.web.api?x=2d2d2&?y=dwdwdw";
string response = client.DownloadString(url)

The problem I am experiencing is that the program will reach it's peak speed (in terms of how many web-pages it has processed) if I make it spawn about 20 threads, that speed being about 1,000 pages per minute. Any more threads than that and it's speed becomes correlated negatively to how many threads I add.

On the other hand, if I launch 10 or even 20 separate instances of the program and spawn 20 threads into each, all instances of the program will reach the same top speed resulting in a cumulative speed of 1000 per minute * number of program instances running.

I read here on stackoverflow that:

Both processes and threads are independent sequences of execution. The typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces.

So I figure the problem is in the size of the shared memory space, but how do I change that so that I could have a single instance running as many threads as my network capacity will handle?

If the problem isn't shared memory space then what is the limiting factor/bottleneck and how might I work around it?

Thanks in advance for any help or suggestions :).

Pi_
  • 2,010
  • 5
  • 22
  • 24
  • 1
    There is a default limit of 2 simultaneous connections. The answers to this question describe how to change that: http://stackoverflow.com/q/866350/517852 – Mike Zboray Jun 23 '13 at 22:48
  • The shared memory space mentioned is actually the process virtual address space that is being shared among its threads. Its size is 2 GiB on 32-bit Windows versions (or 3 GiB if the special 3/1 tuning option is in effect) and much more on 64-bit systems. – Hristo Iliev Jun 24 '13 at 08:59
  • @mikez This is absolutely beautiful. Please submit an answer to my question and I will un-accept the current one in favor of yours which solved my problem with a single line of code! – Pi_ Jul 02 '13 at 13:14

1 Answers1

2

All WebClient instances (in the same AppDomain) are limited to active 2 connections by default. You can change this programmatically by setting System.Net.ServicePointManager.DefaultConnectionLimit property. This can also be configured using app.config. This question shows several options for changing the limit. Just make sure the web api doesn't block you for making too many requests!

Community
  • 1
  • 1
Mike Zboray
  • 39,828
  • 3
  • 90
  • 122
  • Thank you very much, my program now runs at full speed with a single process running about 300 threads at which point the cpu becomes the bottlebeck when it reaches about 80% load. That or network capacity, hard to tell but not important. – Pi_ Jul 02 '13 at 15:20