3

I have been trying for a while to use the PoDoFo C++ library to extract text and lines (with their respective coordinates). But I have no way to do this.

This is what I have so far:

#include <iostream>
#include <stdio.h>
#include <vector>
#include <podofo/podofo.h>
using namespace PoDoFo;
using namespace std;

int main( int argc, char* argv[] )
{
    const char* filename = "hello.pdf";
    PdfVecObjects *x = new PdfVecObjects();
    PdfParser parser(x, filename);
    parser.ParseFile("hello.pdf");

    for (TIVecObjects obj = x->begin(); obj != x->end(); obj++){
        PdfObject * a = x->RemoveObject(obj);
        // THIS IS MY PROBLEM VVVVVVVVVV
        cout << a->Reference().ToString() << endl;
    }

    return 0;
}

However, this only gives me incredibly basic information (seems to be object number)

DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
1 0 R
2 0 R
3 0 R
4 0 R
5 0 R
6 0 R
7 0 R
8 0 R
9 0 R
10 0 R
11 0 R

I want to print out the coordinates of an object, and if it's a line or text. If it's text, I would also like to be able to print out the text. Does anyone that knows this library better than I do know what I could do to fix this?

Qamar Suleiman
  • 1,228
  • 2
  • 18
  • 31
Dara Java
  • 2,410
  • 3
  • 27
  • 52

2 Answers2

4

This answer will show you how to extract the text.

To get text positioning information, you will also have to process the following commands:

Tc, Tw, Tz, TL, T*, Tr and Tm.

You definitely need to download the PDF spec from Adobe to get all the details. There is a chapter devoted entirely to text processing. It is well worth your time to print out that chapter as you will be referring to it a lot. Everything you need to know is in there, but it's not always obvious.

You will also need to use a bit of Linear Algebra. Nothing too complicated, though.

Since there are many ways to achieve the same results, it is important to implement all the commands thoroughly, even if the documents you are going to process might not seem to need certain features. For example: I ran across a document which set all text sizes to one point, which threw off all my calculations until I realized it was using the text scaling factor to set the actual font sizes.

Community
  • 1
  • 1
Ferruccio
  • 98,941
  • 38
  • 226
  • 299
  • I know this post is old, but I am interested in the solution, how to get the text position ? @Dara Javaherian – simon Aug 30 '16 at 14:43
  • Haha no man, sorry. My honest advice would be to give up - it's a really messy thing. You're better off even using OCR to do what you need. – Dara Java Sep 03 '16 at 20:29
1

Use the PoDoFo tools "podofotxtextract" it gives you x,y coordinate (tool folder of PoDoFo package). Extract text from Pdf.

Elligno
  • 79
  • 2