I want to extract specific data from various pdfs that are 3-4 pages each. I don't want to parse everything (all the text of each page) and then using for example regular expressions in order to match the data that i want.
So i was looking the documentation, and the php pdfParser has this function $data = $pdf->getPages()[0]->getDataTm();
in which it is returnig you an array and it says that You can extract transformation matrix (indexes 0-3) and x,y position of text objects (indexes 4,5).
(https://github.com/smalot/pdfparser/blob/master/doc/Usage.md)
So i tried it and it is returning an array with all the data that i want, plus each data's coordinates..
Here an example of you to try it if you want.
require_once __DIR__ . '/vendor/autoload.php';
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $parser->parseFile('pdfFile.pdf');
$data = $pdf->getPages()[0]->getDataTm();
print_r($data);
Now let's say i have the coordinates, but i don't know how to use them in order to find the exact data that i want.
I was looking the documentation for a function that you can apply the coordinates something like this functionXYcoordinates("260", "120")
in order to get what i exaclty want from my pdf.. but I couldn't find anything.
If anyone knows if there is a function like this in pdfParser, please let me know, or also feel free if you believe that extracting data via coordinates is a bad thing, and it is better by parsing all the pages and then using regular expression in order to match the specific data.