Paragraph Reading in PDF

Question

In my code, I need to read the PDF file content and based on some specific requirement I need to insert the content of PDF into SQL server DB. I used iTextsharp for PDF reading. It reads well when it found the entire line in PDF. Problems come when they found a table inside the PDF.

It first gets into column1 and reads the line and jumps into column2 and reads that line and so on. Problem is column1 has paragraph string and column2 has paragraph string. It breaks those paragraph into single different lines which have no meaning.

I want it to work like go to column1 read paragraph and if it find new paragraph after newline then read the paragraph from second line. After processing column1 then jumps into colum2.

Currently I am using below code:

PdfReader reader = new PdfReader(@"D:\pdf1.pdf");
int PageNum = reader.NumberOfPages;

StringBuilder text = new StringBuilder();

for (int i = 1; i <= PageNum; i++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);

    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
                                          Encoding.UTF8, 
                                          Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);

    ReadContent(text.ToString());
    text.Clear();   
}

Possible duplicate of [Reading PDF documents in .Net](https://stackoverflow.com/questions/83152/reading-pdf-documents-in-net) — Gaurav Mall, Jul 01 '19 at 12:59
First of all, what do you do those `Encoding` gymnastics for? Furthermore, you use the `SimpleTextExtractionStrategy`. This strategy returns the content in the order it is drawn. It is seldom that a pdf generator draws content in the order you describe. Chances are in that case that copy&paste from Adobe Reader returns the text similarly. — mkl, Jul 01 '19 at 17:45
Hello do you have any other way that I can read text from table cell separately ? — Udai Mathur, Jul 02 '19 at 07:06

Paragraph Reading in PDF

0 Answers0