2

so im making a program which is kind of a web crawler. it downloads the html of a page and parses it for a specific text using regex and then adds it to a list.

to achieve this, i used async http requests. the GET request is sent asynchronously and the parsing operation is performed on the returned html.

my issue, and i'm not sure if it may be simple, is that the program doesn't run smoothly. it will send a bunch of requests, pause for a couple seconds, then increments the items parsed all at once (although the counter is programmed to increment once every time an item is added) so that for example it jumps from 53 to 69 instead of showing, 54,55,56,...

sorry for being a newb but i taught myself all this stuff and some experienced advice would go a long way.

thanks

blizz
  • 4,102
  • 6
  • 36
  • 60
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – SLaks May 17 '12 at 03:14
  • this is for a specific site where the resulting html is always in the same form with changing variables so regex works fine. – blizz May 17 '12 at 03:57
  • but just out of curiosity, is there another method of doing it more efficiently? – blizz May 17 '12 at 03:58

1 Answers1

4

That sounds correct.

The slowest part of your task is downloading the pages over the network.

Your program starts downloading a bunch of pages at once, waits for them to arrive, then parses them all almost instantly.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • in that case, can I give priority to the main thread somehow? that is, the thread that is queuing the async requests into ThreadPool? i need this because the main thread is also making a request each time 20 async requests have been made. so whats happening is that its being backlogged behind all the already queued ThreadPool requests and blocking the whole program waiting for its response. – blizz May 17 '12 at 03:42
  • @user1115071: Consider using the TPL, which is already optimized for this. – SLaks May 17 '12 at 11:45
  • Please forgive my ignorance as I've never used the TPL. Should I be using it for all threads, or only for the main ones I mentioned? – blizz May 18 '12 at 22:20
  • Use `Parallel.For*` or `Task` or LINQ `AsParallel()` and don't use threads or the threadpool directly at all. – SLaks May 20 '12 at 02:06