0

I have a PDF containing several geometric objects (mostly lines) in different sizes and color. I want to extract them in the following form, e.g. for lines:

  • (startx, starty)
  • (endx, endy)
  • width
  • color

Optinal a "z" Position determining which object is drawn first. The language of my choice is C++ and I thought about PoDoFo, respectively PDFMM, as it should be more accessible. However I am total lost how to acess this information...

I found the following reference: PDF parsing in C++ (PoDoFo)

however I was not able to make the PdfTokenizer work. The Tokenizer.TryReadNextToken needs a InputStreamDevice object, and I do not know how to get it.

For example: I create a single page with just one line in pdfmm. And now I want to extract this information:

#include <pdfmm/pdfmm.h>

int main()
{


try {
    PdfMemDocument document;

    document.Load("test.pdf");
    PdfPage* page = document.GetPages().CreatePage(PdfPage::CreateStandardPageSize(PdfPageSize::A4));

    // Draw single line
    PdfPainter painter;
    painter.SetCanvas(page);

    painter.GetGraphicsState().SetLineWidth(10);
    painter.DrawLine(0, 0, page->GetRect().GetWidth(), page->GetRect().GetHeight());
    painter.FinishDrawing();


    // Loop over all token of page
    PdfTokenizer token(true);
    char* stoken = nullptr;
    PdfVariant var;
    PdfContentType type;

    while (token.TryReadNextToken( ????  ,stoken,type)) {


    }


}
catch (PdfError& err)
{

    err.PrintErrorMsg();
    return (int)err.GetError();

}


}

If anybody could push me in the correct direction, this would be awesome! And if somebody has a good documentation about the structure of a pdf and/or a good tutorial of pdfmm / PoDoFo, this would also highly appreciated...

General Grievance
  • 4,555
  • 31
  • 31
  • 45
Peter
  • 1
  • Okay this looks more challenging as I first thought... – Peter Sep 17 '22 at 07:47
  • Maybe I should give some more insight in my project: I have some blueprints and I am writing a small app which allows to place some symbols inside this plan. The first version of this programm works with OpenGL, where -for now- the blueprints are loaded as jpg's. However when it comes to scaling and zooming into the blueprint I got some blurring. So my thought was extracting all the lines and draw them in OpenGl directly. But maybe I should generate some high resolution jpg's of the blueprints and load them. And use Pffmm for writing only. Still I would like to know if it is possible – Peter Sep 17 '22 at 07:53
  • Main contributor of pdfmm here: maybe you should have a look at `PdfContentsReader` which is an higher level API, much more advanced, to access contents from page. Here is an example of [use](https://github.com/pdfmm/pdfmm/blob/40ceded4bbb337cb10ad1b14d4d5f3929522be81/src/pdfmm/base/PdfPage_TextExtraction.cpp#L207). – ceztko Dec 03 '22 at 11:23

0 Answers0