Read pdf files with php

Question

I have a large PDF file that is a floor map for a building. It has layers for all the office furniture including text boxes of seat location.

My goal is to read this file with PHP, search the document for text layers, get their contents and coordinates in the file. This way I can map out seat locations -> x/y coordinates.

Is there any way to do this via PHP? (Or even Ruby or Python if that's what's necessary)

Does the markup code contain "coordinates"? If not, you can search as long as you want. PHP can not locate pixels coordinates of a pdf file. Try to explain your "problem" a bit more detailed, maybe by using a picture, etc. — Julius F, Oct 21 '09 at 19:18
Hello, did you find an answer to your question? Because i've stuck with the simillar problem and i can't find a solution... And if you found one, could you please tell me how did you get coordinates of images from the pdf file?.. — Pigalev Pavel, Jan 17 '13 at 10:07

score 44 · Accepted Answer · edited Sep 28 '20 at 10:23

44

Check out FPDF (with FPDI):

http://www.fpdf.org/

http://www.setasign.de/products/pdf-php-solutions/fpdi/

These will let you open an pdf and add content to it in PHP. I'm guessing you can also use their functionality to search through the existing content for the values you need.

Another possible library is TCPDF: https://tcpdf.org/

Update to add a more modern library: PDF Parser

edited Sep 28 '20 at 10:23

Martin

22,212
11
70
132

answered Oct 17 '09 at 17:49

Jay

2,123
1
22
29

4

As far as parsing the pdf into php, fpdf falls short while pdfparser http://www.pdfparser.org/documentation has a clean and intuitive programming interface – Nate Jun 24 '15 at 02:17
7

hi @Nate! I added the pdf parser library to the answer. Thanks for the downvote on a 6 year old answer! – Jay Jun 24 '15 at 06:35
That's why "primarily opinion based" questions are out of bounds on here in the first place. Also, I don't think there's anything bad about expressing an opinion on a 6 year old question, but I agree in this case the down vote is silly. So I upvoted you :) – David van Driessche Jun 24 '15 at 08:43
2

While searching for my own answers, I came across this information and at the time I wasn't looking for the information's age. This site is a good resource but only if the information is true. – Nate Jun 24 '15 at 11:33
The fpdf FAQ states, "18. I'd like to make a search engine in PHP and index PDF files. Can I do it with FPDF? No." While the OP isn't looking for a search engine, this Q & A demonstrates fpdf's inability to parse textual elements from a pdf, which is what the OP and myself are looking for. Your provided solution isn't a solution to the original question and now, it seems, the ignorance is spreading. It is vital that the information on this site remain accurate otherwise it is another "yahoo answers". – Nate Jun 24 '15 at 11:59
@Nate, I hear ya, and that's why I added it to the answer, so anyone looking will see it. I think for really old questions, adding comments or answers to keep the info up to date is a great way keep the site relevant and accurate. – Jay Jun 24 '15 at 16:55
Indeed using PDF Parser (https://github.com/smalot/pdfparser) is still the quickest and easiest solution – Jouke Aug 09 '23 at 12:03
@Jouke is this PDF Parser also can use to get coordinates x and y of tag string (like # or etc)? – faradie Aug 10 '23 at 05:22

kasper Taeymans · Answer 2 · 2015-07-10T14:30:52.870

31

There is a php library (pdfparser) that does exactly what you want.

project website

http://www.pdfparser.org/

github

https://github.com/smalot/pdfparser

Demo page/api

http://www.pdfparser.org/demo

After including pdfparser in your project you can get all text from mypdf.pdf like so:

<?php
$parser = new \installpath\PdfParser\Parser();
$pdf    = $parser->parseFile('mypdf.pdf');  
$text = $pdf->getText();
echo $text;//all text from mypdf.pdf

?>

Simular you can get the metadata from the pdf as wel as getting the pdf objects (for example images).

edited Jul 10 '15 at 14:30

answered Jan 23 '14 at 10:42

kasper Taeymans

6,950
5
32
51

I've tried this library. Many PDF files are not parsed by this library, else it works – Shakeel Ahmed Jan 15 '19 at 09:56

Rado · Answer 3 · 2019-11-01T23:59:14.780

5

Not exactly php, but you could exec a program from php to convert the pdf to a temporary html file and then parse the resulting file with php. I've done something similar for a project of mine and this is the program I used:

PdfToHtml

The resulting HTML wraps text elements in < div > tags with absolute position coordinates. It seems like this is exactly what you are trying to do.

edited Nov 01 '19 at 23:59

answered Jun 17 '09 at 00:39

Rado

8,634
7
31
44

3

Hey Can you post a sample code how to achieve your results. I couldn't find proper documentation. It would be great. – Tarik Jul 05 '11 at 09:25

score 3 · Answer 4 · answered Apr 09 '18 at 15:19

your initial request is "I have a large PDF file that is a floor map for a building. "

I am afraid to tell you this might be harder than you guess.

Cause the last known lib everyones use to parse pdf is smalot, and this one is known to encounter issue regarding large file.

Here too, Lookig for a real php lib to parse pdf, without any memory peak that need a php configuration to disable memory limit as lot of "developers" does (which I guess is really not advisable).

see this post for more details about smalot performance : https://github.com/smalot/pdfparser/issues/163

score 0 · Answer 5 · answered Oct 11 '13 at 08:58

0

You might want to also try this application http://pdfbox.apache.org/. A working example can be found at https://www.jinises.com

answered Oct 11 '13 at 08:58

Mike

9
1

2

Sorry, but this is Java and not PHP :-/ – Michael Walter Sep 05 '15 at 13:10

Read pdf files with php

5 Answers5

Linked

Related