0

I'm crawling a web page list of links that are either web pages or large binary files (PPT etc), using javascript and jquery.

How do I detect whether the content is a web page ('text/html') or not? I'm pretty sure it is looking at the HTTP header using $.ajax, and I know there are some similar posted questions, but I can't find an example that fits this particular question.

Cœur
  • 37,241
  • 25
  • 195
  • 267
El-Jus
  • 61
  • 6

3 Answers3

3

You can check extension of url - lightest method. Or you can try ajax solution

var url = 'someurl';
var xhttp = new XMLHttpRequest();
xhttp.open('HEAD', url);
xhttp.onreadystatechange = function () {
  if (this.readyState == this.DONE) {
    console.log(this.status);
    console.log(this.getResponseHeader("Content-Type"));
  }
};
xhttp.send();
Mateusz Kudej
  • 447
  • 1
  • 8
  • 23
2

You won't reliably be able to infer the type from the URL, as it may contain an extension like exe or html, but doesn't have to, and if it does, it's not a guarantee.

The closest you can get without completely downloading and examining the file is probably to fire off a HEAD HTTP request to the URL. This should return the response headers without the body, which in turn should contain the Content-Type header. This all depends on the implementation and configuration of the backend though, so no guarantee that the request will be answered correctly or even answered at all.

TimoStaudinger
  • 41,396
  • 16
  • 88
  • 94
1

If you have the file names, you can use filename.split('.').pop() This returns the extension of the file.

Ryan Knutson
  • 108
  • 2
  • 14