Extracting specific data via coordinates using php pdfParser

Question

I want to extract specific data from various pdfs that are 3-4 pages each. I don't want to parse everything (all the text of each page) and then using for example regular expressions in order to match the data that i want.

So i was looking the documentation, and the php pdfParser has this function $data = $pdf->getPages()[0]->getDataTm(); in which it is returnig you an array and it says that You can extract transformation matrix (indexes 0-3) and x,y position of text objects (indexes 4,5). (https://github.com/smalot/pdfparser/blob/master/doc/Usage.md)

So i tried it and it is returning an array with all the data that i want, plus each data's coordinates..

Here an example of you to try it if you want.

require_once __DIR__ . '/vendor/autoload.php';
use Smalot\PdfParser\Parser;

$parser = new Parser();
$pdf = $parser->parseFile('pdfFile.pdf');

$data = $pdf->getPages()[0]->getDataTm();
print_r($data);

Now let's say i have the coordinates, but i don't know how to use them in order to find the exact data that i want. I was looking the documentation for a function that you can apply the coordinates something like this functionXYcoordinates("260", "120") in order to get what i exaclty want from my pdf.. but I couldn't find anything.

If anyone knows if there is a function like this in pdfParser, please let me know, or also feel free if you believe that extracting data via coordinates is a bad thing, and it is better by parsing all the pages and then using regular expression in order to match the specific data.

Extracting specific data via coordinates using php pdfParser

0 Answers0