1

I want to access a web server using httpwebrequest and fetch thousands of records from a given range of pages. Each hit to a webpage fetches 15 records, and there are almost 8 to 10000 pages on the webserver. That means a total of 120000 hits to the server! If done trivially with a single process, the task can be very time consuming. Hence, multiple threading is the immediate solution that comes to mind.

Currently, I have created a worker class for searching purpose, that worker class will spawn 5 subworkers (threads) to search in a given range. But, due to my novice abilities in threading, I am unable to make it work, as I am having trouble synchronizing and making them all work together. I know about delegates, actions, events in .NET but making them to work with threads is getting confusing..This is the code that I am using:

public void Start()
{
    this.totalRangePerThread = ((this.endRange - this.startRange) / this.subWorkerThreads.Length);
    for (int i = 0; i < this.subWorkerThreads.Length; ++i)
    {
        //theThreads[counter] = new Thread(new ThreadStart(MethodName));
        this.subWorkerThreads[i] = new Thread(() => searchItem(this.startRange, this.totalRangePerThread));
        //this.subWorkerThreads[i].Start();
        this.startRange = this.startRange + this.totalRangePerThread;
    }

    for (int threadIndex = 0; threadIndex < this.subWorkerThreads.Length; ++threadIndex)
        this.subWorkerThreads[threadIndex].Start();
}

The searchItem method:

public void searchItem(int start, int pagesToSearchPerThread)
{
    for (int count = 0; count < pagesToSearchPerThread; ++count)
    {
     //searching routine here
    }
}

The problem exists between the shared variables of the threads, can anyone guide me how to make it a threadsafe procedure?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
faizanjehangir
  • 2,771
  • 6
  • 45
  • 83
  • I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". – John Saunders Mar 22 '13 at 18:46
  • @JohnSaunders will be careful next time around..thanks – faizanjehangir Mar 22 '13 at 18:49
  • @faizanjehangir - Have you considered using a database that provides FTS features? This is really susceptible to DDOS if you're doing it manually. – beatgammit Mar 22 '13 at 18:53
  • If you don't control that web server (i.e. it belongs to somebody else), they're likely to block you for hitting them too fast. They'll think you're doing a DDOS attack. And if you do control the server (and thus the data), can't you get the information you need another way? (Such as a data dump, etc.) – Jim Mischel Mar 22 '13 at 19:33
  • possible duplicate of [Optimizing download of multiple web pages. C#](http://stackoverflow.com/questions/6062079/optimizing-download-of-multiple-web-pages-c-sharp) – Peter Ritchie Mar 22 '13 at 19:35

2 Answers2

1

The first answer is that these threads don't really need that much work to share variables (assuming I'm understanding you correctly). They have some shared input variables (the target web server, for example), but those are thread-safe because they aren't being changed. The plan is that they'll build a database or some such containing the records they retrieve. You should be fine to just have each of the five fill their own input archive, and then merge them in a single thread once all the subworker threads are done. If somehow the architecture that you're using to store the data makes that expensive... well, how much you're planning to store and what you're planning to store it in becomes pertinent, then, and perhaps you could share what those are?

Ben Barden
  • 2,001
  • 2
  • 20
  • 28
  • What I mean by variable is being shared is the fact that variable `start` ends up having the updated value by the last thread, it overwrites the previous ones... – faizanjehangir Mar 22 '13 at 18:59
  • So don't overwrite it. Have each of the threads create a new local variable, and leave the input parameter unchanged. I personally consider leaving input parameters alone by default to be good coding practice in general. – Ben Barden Mar 22 '13 at 19:01
  • But will the threads spawned not share the same function `searchItem` everytime they start? Overwriting or using each others variables? – faizanjehangir Mar 22 '13 at 19:07
  • If you define the local variable inside the searchitem function, then it's scoped to within that execution of that function. The searchItem function is indeed being executed by all of the threads, but each execution is thread-specific - as if you were running it in series rather than in parallel. – Ben Barden Mar 22 '13 at 19:12
1

the real problem you're facing is that the labmda expression in the Thread constructor is capturing the outer variable (startRange). One way to fix it is to make a copy of the variable, like this:

for (int i = 0; i < this.subWorkerThreads.Length; ++i)
{
    var copy = startRange;
    this.subWorkerThreads[i] = new Thread(() => searchItem(copy, this.totalRangePerThread));
    this.startRange = this.startRange + this.totalRangePerThread;
}

for more information on creating and starting threads, see Joe Albahari's excellent ebook (there's also a section on captured variables a bit further down). If you want to learn about closures, see this question.

Community
  • 1
  • 1
vlad
  • 4,748
  • 2
  • 30
  • 36