2

I am trying to extract raw text from a PDF file. Already I found PoDoFo library, which seems to be able make this job.

Based on this answer there is what I did for now:

#include <iostream>
#include <string>
#include <podofo/podofo.h>

//using namespace PoDoFo;

int main( int argc, char* argv[] )
{
    PoDoFo::PdfMemDocument pdf("inputpdftest.pdf");
    for (int pn = 0; pn < pdf.GetPageCount(); ++pn) 
    {
        std::cout << "Page: " << pn << std::endl;
        PoDoFo::PdfPage* page = pdf.GetPage(pn);
        PoDoFo::PdfContentsTokenizer tok(page);
        const char* token = NULL;
        PoDoFo::PdfVariant var;
        PoDoFo::EPdfContentsType type;
        while (tok.ReadNext(type, token, var)) 
        {
            if (type == PoDoFo::ePdfContentsType_Keyword)
            {
                // process type, token & var
                if (var.IsArray()) 
                {
                    PoDoFo::PdfArray& a = var.GetArray();
                    for (size_t i = 0; i < a.GetSize(); i++)
                    {
                        if (a[i].IsString())
                        {
                            std::string str = a[i].GetString().GetStringUtf8();
                            std::cout << str << " ";
                        }
                    }
                }
            }
        }
    }
    return 0;
}

Output is already exactly the same as opening PDF using Notepad, just some trash like:

  ( : ˝  ˝   - H  -   ( : ˝ ˇ  ; 7  < ˝ ˙ ˝  )     ˆ + 0  ( : ˝     % ˆ % ˘ ˚ : ˇ  ( 7  < ˝ ˙ ˝  )       ( -  ˝   % ' ˝ ) - 0 ˝      ˜ % / ˚ (  ˙ ˚ : ˇ  ( 7  < ˝ ˙ ˝  )       ( -  ˝   % ' ˝ ) - 0 ˝    ˜ % / ˚ (  ˙ ˚ : ˇ  ˆ 7  < ˝ ˙ ˝  )    

It's obvious, because I did not managed to convert this informations to normal text, what I am asking how to do it?

So, as You can see I have to process data of PDF using GetString function. Now I am going through each token, checking if is array (and contains PDF commands like TJ etc.), then using on such element GetString. In mentioned by me answer there is nothing said about how I can handle this further.

From documentation Returns the strings contents it is an array and I should iterate over it?

Input PDF is NOT a scanned picture, or image. In given file there would be always some text, which is possible to higlight, and copy it manually, or search for a word.

Example PDF

I sincerely ask for answer how can I get text from such data.

Community
  • 1
  • 1
Drakonno
  • 23
  • 7
  • Please provide a link to your test PDF, as it may help in determining whether it is a problem with it, or with your code or PoDoFo instead. – Jongware Aug 10 '15 at 06:37
  • 1
    Added, as You asked. You are right, that I should add it earlier. Well, the code is not good, because I just convert `PdfString` to `Utf8` and that's the result. `a[i]` is a PoDoFo object, and `GetString` is extracting only PDF data from it. Thought that UTF8 conversion would do the job, but unfortunately not. – Drakonno Aug 10 '15 at 08:42
  • Thanks! Well, the Good News Everybody is the PDF indeed contains extractable text. The first lines are "This printout has been approved by me, the author." That is good news, but it only means that either PoDoFo is doing something wrong, or ... you are. Looking for help I found an earlier SO answer to your question, and so I'm going to recommend closing yours as a duplicate. – Jongware Aug 10 '15 at 08:55
  • I already did, assuming something wrong, what is told in such post (I even put a link to partial answer). So, as You can see I have to process data of PDF using `GetString` function. Now I am going through each token, checking if is array (and contains PDF commands like `TJ` etc.), then using on such element `GetString`. There is nothing said about how I can handle this further. [From documentation](http://podofo.sourceforge.net/doc/html/classPoDoFo_1_1PdfString.html#a8abd047ca653440a81286964b4c97562) `Returns the strings contents` it is an array and I should iterate over it?*Edited question – Drakonno Aug 10 '15 at 09:08
  • Apologies, you are correct: that answer points you in the right direction but it doesn't answer *your* question. I'm going to have to leave this to more experienced PoDoFo users. – Jongware Aug 10 '15 at 09:12

1 Answers1

1

The problem is the comment

// process type, token & var

Was intended to be replaced with code that actually does a bit of processing. The code inside the if (var.IsArray()) test should only be executed if you've determined that the current command is TJ. You still need to process a number of text commands.

For a better example, look at the source of the podofotextextract tool in the podofo source: https://svn.code.sf.net/p/podofo/code/podofo/trunk/tools/podofotxtextract

Ferruccio
  • 98,941
  • 38
  • 226
  • 299