I have a pdf which contains some links.The link would not be like http://www.example.com/abcd.pdf. but there are some text which is linked to some url. I just want to extract that url.
Asked
Active
Viewed 2,337 times
3
-
1Are you able to get text from a PDF file yet? If not, have a look at this: http://stackoverflow.com/questions/1882318/search-through-pdf-files-with-php. After that, you can search in the text for URL's with REGEX (for example). – Pieter Jul 24 '13 at 11:16
-
i tried with other pdf reader. i am getting the text, but the link(url) associated with the text is not getting. – SHIN Jul 24 '13 at 13:17
-
What are you trying to get the links? `preg_match_all` or something? Post your code.. – Pieter Jul 24 '13 at 14:45
-
@peter, I cannot post all codes here.The code is too long. I am using this code. http://webcheatsheet.com/php/reading_clean_text_from_pdf.php – SHIN Jul 25 '13 at 05:30
-
I mean the code after you extract the text from the PDF. – Pieter Jul 25 '13 at 06:53
-
@peter, After reading the pdf by using that script, i can't see any link(url). That is the issue i am facing. If there is any link means we can get it by using 'preg_match_all' as u said. If i open the pdf with notepad i can see the link. But when open using 'fopen('abcd.pdf', 'r')' its returning binary(encrypted) values.After changing the extension to .txt its showing same binary format. – SHIN Jul 25 '13 at 09:42
1 Answers
0
There is no need to go for pdf read options seperately as i did initially. We can simply read pdf file by fopen() method or file_get_contents() method.
$pdf_content = file_get_contents($actual_pdf_file, true);
preg_match_all('/URI\(([^,]*?)\)\/S\/URI/', $pdf_content, $matches);
I wrote this preg_match_all functions according to my requirement. URI will be there for each link.
Now we will get the urls if any in $matches array. My case this url is a pdf download link. The code for downloading the pdf from the link is below...
foreach($matches[1] as $pdfurl)
{
$CurlConnect = curl_init();
curl_setopt($CurlConnect, CURLOPT_URL, $pdfurl);
curl_setopt($CurlConnect, CURLOPT_POST, 1);
curl_setopt($CurlConnect, CURLOPT_RETURNTRANSFER, 1);
@curl_setopt($CurlConnect, CURLOPT_POSTFIELDS, $request);
$Result = curl_exec($CurlConnect);
$new_down_pdf='new_pdf_name.pdf';
file_put_contents($new_down_pdf,$Result);
}

SHIN
- 97
- 1
- 1
- 6