How to check whether pdf file exists?

Question

I am trying to check if pdf file exists in arXiv. There are two example

arxiv.org/pdf/1207.4102.pdf

arxiv.org/pdf/1207.41021.pdf

The first is a pdf file and the second is not and returns an error page.

Is there a way to check whether a url is pdf or not. I tried the answers in How do I check if file exists in jQuery or JavaScript? however none of them work and they return true (i.e. file exists) for both urls. Is there a way to find which url is pdf file in JavaScript/jQuery or even PHP?

Can this be solved using pdf.js?

It looks like http://arxiv.org/ .htaccess is rewriting all requests and has not an error page set so... all requests will receive a 200 answer... try http://arxiv.org/pdf/1207.41102.pdf in your browser... You could then parse the response to see if it's html... if not then it could be your pdf. — Julio Soares, Sep 30 '15 at 06:20

score 0 · Answer 1 · answered Sep 30 '15 at 06:04

You Can try this code for checking remote server file exists or not by Url

 $filename= 'arxiv.org/pdf/1207.4102.pdf';
    $file_headers = @get_headers($filename);

    if($file_headers[0] == 'HTTP/1.0 404 Not Found'){
          echo "The file $filename does not exist";
    } else if ($file_headers[0] == 'HTTP/1.0 302 Found' && $file_headers[7] == 'HTTP/1.0 404 Not Found'){
        echo "The file $filename does not exist, and I got redirected to a custom 404 page..";
    } else {
        echo "The file $filename exists";
    }

@user3741635 remote server file checking, I recommend you do on server side always. — Ajeet Kumar, Sep 30 '15 at 06:07

Pedro Lobito · Answer 2 · 2015-09-30T06:11:57.563

You may want to use curl and check for a 200 http status code , i.e.:

<?php

$url = 'http://arxiv.org/pdf/1207.41021.pdf';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);    // we want headers
curl_setopt($ch, CURLOPT_NOBODY, true);    // we don't need body
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); // we follow redirections
curl_setopt($ch, CURLOPT_TIMEOUT,10);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


if($httpcode == "200"){
    echo "file exist";
}else{
    echo "doesn't exist";
}

Both pdf files return 403 Forbidden

The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.

score 0 · Accepted Answer · answered Sep 30 '15 at 06:09

0

It return correct result.

function getHTTPCode($url) {

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_HEADER, true);   
    curl_setopt($ch, CURLOPT_NOBODY, true);   
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_TIMEOUT,10);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)');
    $output = curl_exec($ch);
    $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    return $httpcode;

}

$url = 'http://arxiv.org/pdf/1207.41021.pdf';
if(getHTTPCode($url)==200) {
 echo  'found';
} else {
 echo  'not found';
}

answered Sep 30 '15 at 06:09

Samir Das

1,878
12
20

What's the main different between your answer and mine ? Not cool to copy my answer. – Pedro Lobito Sep 30 '15 at 06:09
It seems same except browser agent. I did not post my answer if I knew there was already curl solution. You have posted your answer while I was testing my one. – Samir Das Sep 30 '15 at 06:13
You should've checked before posting. – Pedro Lobito Sep 30 '15 at 06:17
@PedroLobito, I don't think your answer works because it returns false for both urls. – user3741635 Sep 30 '15 at 06:18
It should, both pdfs have a 403 status code. – Pedro Lobito Sep 30 '15 at 06:19
nope, it returns 404 which is correct. You can check both url in browser and check response code. But your code always return 403 which is wrong – Samir Das Sep 30 '15 at 06:20
@Samir, thats right. Does this code work for any pdf or just the ones on the arXiv. – user3741635 Sep 30 '15 at 06:21
Any http request actually :) – Samir Das Sep 30 '15 at 06:21
You've copied my answer and now you're telling me it doesn't work ? look, learn to code and don't be a faker. – Pedro Lobito Sep 30 '15 at 06:24
@PedroLobito, your code returns false for both urls when it should return true for the first. – user3741635 Sep 30 '15 at 06:26
@PedroLobito ha ha. You are the genius in the planet who can only write curl using php – Samir Das Sep 30 '15 at 06:26

score -2 · Answer 4 · edited May 23 '17 at 12:23

-2

Using PHP you can check if the file exists with http://php.net/manual/en/function.file-exists.php

For remote file, check the header on a request https://stackoverflow.com/a/8139136/3222087

edited May 23 '17 at 12:23

Community

1
1

answered Sep 30 '15 at 05:55

Slowmove

462
3
7

How to check whether pdf file exists?

4 Answers4