1

I want to extract specific data from various pdfs that are 3-4 pages each. I don't want to parse everything (all the text of each page) and then using for example regular expressions in order to match the data that i want.

So i was looking the documentation, and the php pdfParser has this function $data = $pdf->getPages()[0]->getDataTm(); in which it is returnig you an array and it says that You can extract transformation matrix (indexes 0-3) and x,y position of text objects (indexes 4,5). (https://github.com/smalot/pdfparser/blob/master/doc/Usage.md)

So i tried it and it is returning an array with all the data that i want, plus each data's coordinates..

Here an example of you to try it if you want.

require_once __DIR__ . '/vendor/autoload.php';
use Smalot\PdfParser\Parser;

$parser = new Parser();
$pdf = $parser->parseFile('pdfFile.pdf');

$data = $pdf->getPages()[0]->getDataTm();
print_r($data);

Now let's say i have the coordinates, but i don't know how to use them in order to find the exact data that i want. I was looking the documentation for a function that you can apply the coordinates something like this functionXYcoordinates("260", "120") in order to get what i exaclty want from my pdf.. but I couldn't find anything.

If anyone knows if there is a function like this in pdfParser, please let me know, or also feel free if you believe that extracting data via coordinates is a bad thing, and it is better by parsing all the pages and then using regular expression in order to match the specific data.

ThunderBoy
  • 391
  • 1
  • 3
  • 18

0 Answers0