1

I've come across a website that seems to resist being read via HTTP/Get with C# but works fine in Firefox, Postman and Powershell and I would really love to understand why.

Here is the repro-case in powershell

Invoke-WebRequest -UseBasicParsing -Uri https://www.tripadvisor.de/

This call works, it gets a 200 and powershell dumps the result.

When I do the same in C#, the request hangs until it times out, I'll never get a response.

Here is the c# code. It is not that complicated

var client = new HttpClient();
var r = await client.GetAsync("https://www.tripadvisor.de/");
r.EnsureSuccessStatusCode();

In a .net6.0 application with top level statements this is actually the entire program code. I've tested this in a .net framework 4.8 console app (of course with an async Task main), same behaviour.

Tested on Windows 11 and Windows 10.

Can somebody reproduce this behaviour?

P.S.: I am sure this is not an async/await or deadlock issue, because plenty of other sites do work correctly

**Edit: ** Working hypothesis is that the server simply ignores those request by some metric because it assumes that this request come from an invalid client (non-browser, scraper, flooder).

What I have also tried

I've been trying to make this C# code work by passing a HttpClientHandler with Tls1.2

using var handler = new HttpClientHandler() { SslProtocols = SslProtocols.Tls12 };
var client = new HttpClient(handler);

Same result.

I've been sending ALL the additional headers, that for example firefox sends (User-Agent, Accept, Accept-Encoding, etc), no luck.

I've added a certificate callback to the handler just in case, but sslPolicyErrors had no errors anyway:

handler.ServerCertificateCustomValidationCallback = (message, cert, chain, sslPolicyErrors) => true;

The parallel stack shows, that the code tries to receive a full TLS frame (I am just guessing)

tls-stack

Can somebody test the basic C# code and validate the behaviour? Is this an issue with HttpClient? Am I missing something? Is this server configured in a weird way?

Update: The old, and deprecated WebRequest approach worked for a short time:

var res = WebRequest.Create("https://www.tripadvisor.de/");
res.Method = "GET";
var resp = res.GetResponse();

var content = await new StreamReader(resp.GetResponseStream()).ReadToEndAsync();

Console.WriteLine(content);

While it resulted in several 200, it appears, that later the server black-listed me because I was probably lacking indicators of a "real" user. I understand, that this is some scraper / flood control mechanism. What I am missing is, why Invoke-WebRequest did not cause to be blocked.

Note: There is no http version of this page to test against :/

I've also tried ConfigureAwait(false) but from my understanding this cannot be the issue because other websites do work.

Samuel
  • 6,126
  • 35
  • 70
  • Have you looked at https://stackoverflow.com/questions/10343632/httpclient-getasync-never-returns-when-using-await-async ? And have you tried to do a `ConfigureAwait(false)`? – Lucero Aug 25 '23 at 08:10
  • You just proved that HttpClient just works. `WebClient` uses HttpClient undernead in .NET Core. Even HttpWebRequest is nothing more than a compatibility wrapper over HttpClient – Panagiotis Kanavos Aug 25 '23 at 08:12
  • @Lucero I tried, this does not appear to be the solution. I'll update the post – Samuel Aug 25 '23 at 08:13
  • @Samuel you already proved that HttpClient works and `ConfigureAwait` has nothing to do with this. Your phrase `seems to resist being read via HTTP/Get` is relevant - high traffic sites actually take measures against screen scrapers. Your plain GET request doesn't even send a User-Agent header so it's trivial to label it as a bot – Panagiotis Kanavos Aug 25 '23 at 08:15
  • @PanagiotisKanavos I am aware of that, I've already reproduced sending all relevant headers like firefox does, still no answer. Many server behave different, some reject when user-agent is missing, some reject when user-agent is there, but Accent header is not set, etc. If this would be the case, why does `Invoke-Webrequest` and `WebRequest` do work? – Samuel Aug 25 '23 at 08:17
  • @PanagiotisKanavos do you have a source for the WebRequest uses HttpClient? I've been decompiling the assemblies and am not yet sure about this. – Samuel Aug 25 '23 at 08:19
  • .NET Core is open source, you don't have to decompile anything. Just google for `github dotnet HttpWebRequest.cs`. The direct link is [this one](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Requests/src/System/Net/HttpWebRequest.cs). HttpClient is a static volatile field in [this line](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Requests/src/System/Net/HttpWebRequest.cs#L79) – Panagiotis Kanavos Aug 25 '23 at 08:23
  • 2
    Have you actually checked what's going on in the Network tab when you request `tripadvisor.de`? I see a redirect to www.tripadvisor.de, 20 secs of small responses and a *never ending* GET to `tripadvisor.de`. I don't know whether that's eg a long request to allow server push (pointless in Chrome) or a trap for crawlers – Panagiotis Kanavos Aug 25 '23 at 08:35
  • @PanagiotisKanavos I did, I also decoded TLS traffic in wireshark without a clue :D. network traffic in browser seems related to further async request to other assets, might also create request traps, but I doubt that. Because `Invoke-WebRequest` (with headers) hardly is going to handle those correctly, but seems to work reliably. – Samuel Aug 25 '23 at 09:10
  • You already saw that WebRequest uses HttpClient, so TLS or Wireshark won't help. The settings used by WebRequest will be different from HttpClient's defaults, to ensure compatibility with older code. In HttpWebRequest source I linked above you can see that WebRequest's properties are [used to configure an HttpClientHandler](https://github.com/dotnet/runtime/blob/699ffbe66e273274b0eee213b1ed4564ec4a7840/src/libraries/System.Net.Requests/src/System/Net/HttpWebRequest.cs#L1593) – Panagiotis Kanavos Aug 25 '23 at 09:44
  • 2
    If you have a working request (browser) and a failing request (app), run them both through Fiddler and compare. If the requests are truly identical, then there must be a stateful firewall of sorts, and you'll need to send all the other requests the working client does; the other requests are acting as a "key". It's also possible you may have to simulate delays. In the extreme case, the key may be dynamic and you'd need to embed a browser or at least a JS interpreter to determine the key on each request. – Stephen Cleary Aug 25 '23 at 10:20
  • try https://stackoverflow.com/questions/16828688/how-can-i-emulate-a-web-browser-http-request-from-code – Seabizkit Aug 25 '23 at 12:23

0 Answers0