How to crawl XML(s) very fast — considering the below networking limitations?

Question

I have a .Net crawler that's running when the user makes a request (so, it needs to be fast). It crawls 400+ links in real time. (This is the business ask.)

The problem: I need to detect if a link is xml (think of rss or atom feeds) or html. If the link is xml then I continue with processing, but if the link is html I can skip it. Usually, I have 2 xml(s) and 398+ html(s). Currently, I have multiple threads going but the processing is still slow, usually 75 seconds running with 10 threads for 400+ links, or 280 seconds running with 1 thread. (I want to add more threads but see below..)

The challenge that I am facing is that I read the streams as follows:

var request = WebRequest.Create(requestUriString: uri.AbsoluteUri);
// ....
var response = await request.GetResponseAsync();
//....
using (var reader = new StreamReader(stream: response.GetResponseStream(), encoding: encoding)) {
                        char[] buffer = new char[1024];
                        await reader.ReadAsync(buffer: buffer, index: 0, count: 1024);
                        responseText = new string(value: buffer);
}
// parse first byts of reasponseText to check if xml

The problem is that my optimization to get only 1024 is quite useless because the GetResponseAsync is downloading the entire stream anyway, as I see. (The other option that I have is to look for the header ContentType, but that's quite similar AFAIK because I get the content anyway - in case that you don't recommend to use OPTIONS, that I did not use so far - and in addition xml might be content-type incorrectly marked (?) and I am going to miss some content.)

If there is any optimization that I am missing please help, as I am running out of ideas.

(I do consider to optimize this design by spreading the load on multiple servers, so that I balance the network with the parallelism, but that's a bit of change from the current architecture, that I cannot afford to do at this point in time.)

Maybe have a look at https://stackoverflow.com/questions/21017328/how-to-retrieve-partial-response-with-system-net-httpclient - the top answer suggests using `TcpClient` instead. — Ian Kemp, Jan 22 '19 at 05:47
It seems that there are no real answers to this question: https://weblog.west-wind.com/posts/2014/Jan/29/Using-NET-HttpClient-to-capture-partial-Responses This post from Rick Strahl goes pretty in dedth on this exact problem. — Alberto Chiesa, Jan 22 '19 at 11:50

Tim Andrews · Answer 1 · 2019-01-22T06:30:24.920

Using HEAD requests could speed up the requests significantly, IF you can rely on the Content-Type.

e.g

HttpClient client = new HttpClient();
HttpResponseMessage response = await client.SendAsync(new HttpRequestMessage() { Method = HttpMethod.Head});

Just showing basic usage. Obviously you need to add uri and anything else required to the request.

Also just to note that even with 10 threads, 400 request will likely always take quite a while. 400/10 means 40 requests sequentially. Unless the requests are to servers close by then 200ms would be a good response time meaning a minimum of 8 seconds. Ovserseas serves that may be slow could easily push this out to 30-40 seconds of unavoidable delay, unless you increase the amount of threads to parallel more of the requests.

Dataflow (Task Parallel Library) Can be very helpful for writing parallel pipes with a convenient MaxDegreeOfParallelism property for easily adjusting how many parallel instances can be run.

As the asker stated, "in addition xml might be content-type incorrectly marked (?) and I am going to miss some content" - so s/he doesn't want to rely on that header. — Ian Kemp, Jan 22 '19 at 08:58
Which is why I emphasized it would help IF the Content-Type is reliable. Processing the results of the requests will only take seconds, if that. So the OP needs to either reduce the bandwidth (e.g by using HEAD), and/or increase parallelism which i also suggested in my answer. Making 100 sequential requests to a server on the other side of the world at literally the speed of light will still take half a minute. — Tim Andrews, Jan 23 '19 at 03:28

How to crawl XML(s) very fast — considering the below networking limitations?

1 Answers1