1

Hi I have a pdf with content as following : -

Property Address: 123 Door         Form Type: Miscellaneous
                  ABC City
                  Pin - XXX

So when I use itextSharp to get the content, it is obtained as follows -

Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX

The data is mixed since it is in next line. Please suggest a possible way to get the content as required. Thanks

Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous
mkl
  • 90,588
  • 15
  • 125
  • 265
Ankur Rai
  • 297
  • 1
  • 5
  • 19
  • Try to read PDF in columns [Please see this post](http://stackoverflow.com/questions/25498598/read-columns-of-pdf-in-c-sharp-using-itextsharp) – AKN Feb 22 '17 at 06:22
  • 1
    How do you, when looking at the pdf, recognise that those words belong together in the order you want? As soon as you can describe that in sufficient detail, try to implement that in a program. – mkl Feb 22 '17 at 06:39
  • I got the solution from one of the posts. Its working for me. Please review it. Thanks – Ankur Rai Feb 22 '17 at 07:58

2 Answers2

0

The following code using iTextSharp helped in formatting the pdf -

PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
    tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
    File.AppendAllLines(outfile, tt, Encoding.UTF8);
}
mkl
  • 90,588
  • 15
  • 125
  • 265
Ankur Rai
  • 297
  • 1
  • 5
  • 19
  • The `SimpleTextExtractionStrategy` returns text in the order it is drawn. In your case that seems to be the order you need. But the solution producing your inputs might change over time, and if the orders then don't coincide anymore, you have to find a different way. – mkl Feb 22 '17 at 08:30
  • That been said, `tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)))` is a complicated NOP (No-Operation), isn't it? – mkl Feb 22 '17 at 08:31
0

I'm Using Below helper class to convert PDF to Text file. this one is working clam for me. If any one need full working desktop application please refer this github repo https://github.com/Kithuldeniya/PDFReader

using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;

namespace PDFReader.Helpers
{
    public static class PdfHelper
    {
        public static string ManipulatePdf(string filePath)
        {
            PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));

            //CustomFontFilter fontFilter = new CustomFontFilter(rect);
            FilteredEventListener listener = new FilteredEventListener();

            // Create a text extraction renderer
            LocationTextExtractionStrategy extractionStrategy = listener
                .AttachEventListener(new LocationTextExtractionStrategy());

            // Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
            new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());

            // Get the resultant text after applying the custom filter
            String actualText = extractionStrategy.GetResultantText();

            pdfDoc.Close();

            return actualText;

        }
    }
}