1

I'm using perl-HTTP-Tiny-0.080 on fedora35 and trying to check for the status of a URL to determine the return code. My script runs fine until it comes across this particular URL with a PDF at sophos.com. The script just stalls and the get() or head() call with new() just never returns. I've also tried to set a timeout and it appears to be ignored.

use HTTP::Tiny;  
use Net::FTP::Tiny qw(ftp_get);
my $url = "https://news.sophos.com/wp-content/uploads/2020/02/CloudSnooper_report.pdf";
my $response = HTTP::Tiny->new(timeout => 2)->get($url);
print "status: $response->{status} $url\n";

The print is just never reached. Using wget manually succeeds, while trying to set the agent to something other than "HTTP/Tiny" fails.

my $response = HTTP::Tiny->new(agent => "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")->get($url);

This code is part of a larger script that I'm using to check a series of URLs from a buffer to determine whether they're 404s and should be removed, or are still working links.

I'm unsure what further info I can provide.

Alex Regan
  • 477
  • 5
  • 16

1 Answers1

2

The URL you have for news.sophos.com redirects to some other URL at www.sophos.com. The latter server is protected by Akamai CDN:

$ dig www.sophos.com
...
www.sophos.com.         169     IN      CNAME   www.sophos.com.edgekey.net.
www.sophos.com.edgekey.net. 469 IN      CNAME   e6203.b.akamaiedge.net.
e6203.b.akamaiedge.net. 300     IN      A       23.60.192.131

The bot protection of Akamai can show some weird behavior if the request is not a typical one send by the browser. This might be failing with status code 403 but also just hanging as you experience, i.e. tarpitting the client. See also Requests SSL connection timeout or Strange CURL issue with a particular website SSL certificate. See also Why does Akamai edge services sometime just not send any response, leaving the connection to timeout which incidentally describes a similar problem you have with www.sophos.com.

In this specific case simply adding an Accept header to the request worked for me:

my $response = HTTP::Tiny->new(default_headers => { Accept => '*/*' })->get($url);

Note that this workaround might no longer work in the future when Akamai adjusts its bot detection.

I've also tried to set a timeout and it appears to be ignored.

This is a known issue, which is especially noticeable when TLS 1.3 is used - as is the case here. See Sometimes, timeout can fail to fire #146.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172
  • 2
    Okay, the "Accept" header worked, but where did you learn of this? What does it do? How does it change the behavior whereas changing the user agent isn't sufficient? – Alex Regan Feb 02 '22 at 02:47
  • 1
    @AlexRegan: *"where did you learn of this"* - experience and research :) You are not the first one to tackle this issue and I'm not first time dealing with this kind of problems. *"How does it change the behavior whereas changing the user agent isn't sufficient?"* - user-agent is an too obvious choice. The times were bots could just change it to bypass anti-bot protections is long over. Today's anti-bot protection is harder to bypass. The workaround I've shown might stop working when too much use it - then Akamai will update the protection. – Steffen Ullrich Feb 02 '22 at 05:35