3

I'm a Delphi developer and was tasked to do crawling of <title> and meta description and keywords for public facing websites.

It was fine until I encountered a website that self redirects and did not redirect with HTTP 302/301.

For example, if I type example.com in the url, it will automatically jump to example-b.com - but on the client side - not via HTTP 301 or 302.

My goal is to get title, description, and keywords of example-b.com.

I'm using TidHttp in delphi if that helps.

Edit
I tried this answer but it was stated that will only work with HTTP 301 and 302. I have handled those redirects already. I'm trying to figure out how to handle <meta> refersh tags or other html commands that do redirects.

Edit 2
just found this commands :

<meta http-equiv="refresh" content="5;url=http://thisinterestsme.com/detecting-ajax-requests-with-php/">
header( "refresh:5;url=http://thisinterestsme.com/php-forcing-https-over-http/" );
header('Location: http://thisinterestsme.com/php-forcing-https-over-http/');
window.location.href= 'http://thisinterestsme.com/php-forcing-https-over-http/';

let me know if I missed other commands.

cam8001
  • 1,581
  • 1
  • 11
  • 22
Ago
  • 755
  • 7
  • 28

1 Answers1

3

TIdHTTP does not follow meta refresh redirects even if HandleRedirects is set to True. It does, however, parse <meta http-equiv=..., if hoNoParseMetaHTTPEquiv is not included in property HTTPOptions of TIdHTTP. By default the option is not included. After performing a request you can access parsed values via IdHTTP.MetaHTTPEquiv which is a shorthand for IdHTTP.Response.MetaHTTPEquiv.

Since Indy doesn't handle it, you have to do it yourself with all the burden of parsing URL from the value, performing the redirection and detecting cyclic/infinite redirections. The same goes for Refresh header which is not the part of official standards.

The Location header is only valid with HTTP status codes 201 Created, 202 Accepted and 3xx. It should only redirect when status code is 3xx (except 304 Not Modified), which Indy already does, so you don't need to handle this in any special way.

And at last to support JavaScript redirections brings the task to much higher level of complexity, which TIdHTTP won't be able to crack. That seems to be the use case for a headless browser.

Peter Wolf
  • 3,700
  • 1
  • 15
  • 30