C# Pdf to Text with values in multiple line

Question

Hi I have a pdf with content as following : -

Property Address: 123 Door         Form Type: Miscellaneous
                  ABC City
                  Pin - XXX

So when I use itextSharp to get the content, it is obtained as follows -

Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX

The data is mixed since it is in next line. Please suggest a possible way to get the content as required. Thanks

Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous

Try to read PDF in columns [Please see this post](http://stackoverflow.com/questions/25498598/read-columns-of-pdf-in-c-sharp-using-itextsharp) — AKN, Feb 22 '17 at 06:22
How do you, when looking at the pdf, recognise that those words belong together in the order you want? As soon as you can describe that in sufficient detail, try to implement that in a program. — mkl, Feb 22 '17 at 06:39
I got the solution from one of the posts. Its working for me. Please review it. Thanks — Ankur Rai, Feb 22 '17 at 07:58

score 0 · Accepted Answer · edited Feb 22 '17 at 08:26

0

The following code using iTextSharp helped in formatting the pdf -

PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
    tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
    File.AppendAllLines(outfile, tt, Encoding.UTF8);
}

edited Feb 22 '17 at 08:26

mkl

90,588
15
125
265

answered Feb 22 '17 at 07:57

Ankur Rai

297
1
5
19

The `SimpleTextExtractionStrategy` returns text in the order it is drawn. In your case that seems to be the order you need. But the solution producing your inputs might change over time, and if the orders then don't coincide anymore, you have to find a different way. – mkl Feb 22 '17 at 08:30
That been said, `tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)))` is a complicated NOP (No-Operation), isn't it? – mkl Feb 22 '17 at 08:31

score 0 · Answer 2 · answered Oct 05 '20 at 07:24

I'm Using Below helper class to convert PDF to Text file. this one is working clam for me. If any one need full working desktop application please refer this github repo https://github.com/Kithuldeniya/PDFReader

using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;

namespace PDFReader.Helpers
{
    public static class PdfHelper
    {
        public static string ManipulatePdf(string filePath)
        {
            PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));

            //CustomFontFilter fontFilter = new CustomFontFilter(rect);
            FilteredEventListener listener = new FilteredEventListener();

            // Create a text extraction renderer
            LocationTextExtractionStrategy extractionStrategy = listener
                .AttachEventListener(new LocationTextExtractionStrategy());

            // Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
            new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());

            // Get the resultant text after applying the custom filter
            String actualText = extractionStrategy.GetResultantText();

            pdfDoc.Close();

            return actualText;

        }
    }
}

C# Pdf to Text with values in multiple line

2 Answers2

Linked