0

I'm given to read a pdf texts and do some stuffs are extracting the texts. I 'm using iTextSharp to read the PDF. The problem here is that the PdfTextExtractor.GetTextFromPage doesnt give me all the contents of the page. For ex

enter image description here

In the above PDF I m unable to read texts that are highlighted in blue. Rest of the characters I m able t read. Below is the line that does the above

           `string filePath = "myFile path";
            PdfReader pdfReader = new PdfReader(filePath);
            for (int page = 1; page<=1; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            }`

Any suggestions here?

I have went through lots of queries and solution in SO but not specific to this query.

San
  • 33
  • 10
  • *"Any suggestions here?"* - Yes: Share the PDF so we can analyse it. There actually are a number of possible reasons. – mkl Apr 27 '20 at 08:37
  • https://drive.google.com/file/d/1_VsjExbtB0BlW0l19dNmBJgJ7DKg4wyS/view?usp=sharing can you check if the "From" "To" "SUBJ" are readable – San Apr 30 '20 at 04:38

1 Answers1

1

The reason for text extraction not extracting those texts is pretty simple: Those texts are not part of the static page content but form fields! But "Text extraction" in iText (and other PDF libraries I know, too) is considered to mean "extraction of the text of the static page content". Thus, those texts you miss simply are not subject to text extraction.

If you want to make form field values subject to your text extraction code, too, you first have to flatten the form field visualizations. "Flattening" here means making them part of the static page content and dropping all their form field dynamics.

You can do that by adding after reading the PDF in this line

PdfReader pdfReader = new PdfReader(filePath);

code to flatten this PDF and loading the flattened PDF into the pdfReader, e.g. like this:

MemoryStream memoryStream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, memoryStream);
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();

memoryStream.Position = 0;
pdfReader = new PdfReader(memoryStream);

Extracting the text from this re-initialized pdfReader will give you the text from the form fields, too.

Unfortunately, the flattened form text is added at the end of the content stream. As your chosen text extraction strategy SimpleTextExtractionStrategy simply returns the text in the order it is drawn, the former form fields contents all are extracted at the end.

You can change this by using a different text extraction strategy, i.e. by replacing this line:

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
  • Using the LocationTextExtractionStrategy (which is part of the iText distribution) already returns a better result; unfortunately the form field values are not exactly on the same base line as the static contents we perceive to be on the same line, so there are some unexpected line breaks.

    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
    
  • Using the HorizontalTextExtractionStrategy (from this answer which contains both a Java and a C# version thereof) the result is even better. Beware, though, this strategy is not universally better, read the warnings in the answer text.

    ITextExtractionStrategy strategy = new HorizontalTextExtractionStrategy();
    
mkl
  • 90,588
  • 15
  • 125
  • 265