3

I am trying to check if pdf file exists in arXiv. There are two example

arxiv.org/pdf/1207.4102.pdf

arxiv.org/pdf/1207.41021.pdf

The first is a pdf file and the second is not and returns an error page.

Is there a way to check whether a url is pdf or not. I tried the answers in How do I check if file exists in jQuery or JavaScript? however none of them work and they return true (i.e. file exists) for both urls. Is there a way to find which url is pdf file in JavaScript/jQuery or even PHP?

Can this be solved using pdf.js?

Community
  • 1
  • 1
user3741635
  • 852
  • 6
  • 16
  • It looks like http://arxiv.org/ .htaccess is rewriting all requests and has not an error page set so... all requests will receive a 200 answer... try http://arxiv.org/pdf/1207.41102.pdf in your browser... You could then parse the response to see if it's html... if not then it could be your pdf. – Julio Soares Sep 30 '15 at 06:20

4 Answers4

0

You Can try this code for checking remote server file exists or not by Url

 $filename= 'arxiv.org/pdf/1207.4102.pdf';
    $file_headers = @get_headers($filename);

    if($file_headers[0] == 'HTTP/1.0 404 Not Found'){
          echo "The file $filename does not exist";
    } else if ($file_headers[0] == 'HTTP/1.0 302 Found' && $file_headers[7] == 'HTTP/1.0 404 Not Found'){
        echo "The file $filename does not exist, and I got redirected to a custom 404 page..";
    } else {
        echo "The file $filename exists";
    }
Ajeet Kumar
  • 805
  • 1
  • 7
  • 26
0

You may want to use curl and check for a 200 http status code , i.e.:

<?php

$url = 'http://arxiv.org/pdf/1207.41021.pdf';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);    // we want headers
curl_setopt($ch, CURLOPT_NOBODY, true);    // we don't need body
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); // we follow redirections
curl_setopt($ch, CURLOPT_TIMEOUT,10);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


if($httpcode == "200"){
    echo "file exist";
}else{
    echo "doesn't exist";
}

Both pdf files return 403 Forbidden

The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
0

It return correct result.

function getHTTPCode($url) {

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_HEADER, true);   
    curl_setopt($ch, CURLOPT_NOBODY, true);   
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_TIMEOUT,10);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)');
    $output = curl_exec($ch);
    $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    return $httpcode;

}

$url = 'http://arxiv.org/pdf/1207.41021.pdf';
if(getHTTPCode($url)==200) {
 echo  'found';
} else {
 echo  'not found';
}
Samir Das
  • 1,878
  • 12
  • 20
-2

Using PHP you can check if the file exists with http://php.net/manual/en/function.file-exists.php

For remote file, check the header on a request https://stackoverflow.com/a/8139136/3222087

Community
  • 1
  • 1
Slowmove
  • 462
  • 3
  • 7