3

I have a pdf which contains some links.The link would not be like http://www.example.com/abcd.pdf. but there are some text which is linked to some url. I just want to extract that url.

SHIN
  • 97
  • 1
  • 1
  • 6
  • 1
    Are you able to get text from a PDF file yet? If not, have a look at this: http://stackoverflow.com/questions/1882318/search-through-pdf-files-with-php. After that, you can search in the text for URL's with REGEX (for example). – Pieter Jul 24 '13 at 11:16
  • i tried with other pdf reader. i am getting the text, but the link(url) associated with the text is not getting. – SHIN Jul 24 '13 at 13:17
  • What are you trying to get the links? `preg_match_all` or something? Post your code.. – Pieter Jul 24 '13 at 14:45
  • @peter, I cannot post all codes here.The code is too long. I am using this code. http://webcheatsheet.com/php/reading_clean_text_from_pdf.php – SHIN Jul 25 '13 at 05:30
  • I mean the code after you extract the text from the PDF. – Pieter Jul 25 '13 at 06:53
  • @peter, After reading the pdf by using that script, i can't see any link(url). That is the issue i am facing. If there is any link means we can get it by using 'preg_match_all' as u said. If i open the pdf with notepad i can see the link. But when open using 'fopen('abcd.pdf', 'r')' its returning binary(encrypted) values.After changing the extension to .txt its showing same binary format. – SHIN Jul 25 '13 at 09:42

1 Answers1

0

There is no need to go for pdf read options seperately as i did initially. We can simply read pdf file by fopen() method or file_get_contents() method.

    $pdf_content = file_get_contents($actual_pdf_file, true);
    preg_match_all('/URI\(([^,]*?)\)\/S\/URI/', $pdf_content, $matches);

I wrote this preg_match_all functions according to my requirement. URI will be there for each link.

Now we will get the urls if any in $matches array. My case this url is a pdf download link. The code for downloading the pdf from the link is below...

foreach($matches[1] as $pdfurl)
    {       
    $CurlConnect = curl_init();
    curl_setopt($CurlConnect, CURLOPT_URL, $pdfurl);
    curl_setopt($CurlConnect, CURLOPT_POST, 1);
    curl_setopt($CurlConnect, CURLOPT_RETURNTRANSFER, 1);
    @curl_setopt($CurlConnect, CURLOPT_POSTFIELDS, $request);
    $Result = curl_exec($CurlConnect);
    $new_down_pdf='new_pdf_name.pdf';
    file_put_contents($new_down_pdf,$Result);
    }
SHIN
  • 97
  • 1
  • 1
  • 6