In my code, I need to read the PDF file content and based on some specific requirement I need to insert the content of PDF into SQL server DB. I used iTextsharp for PDF reading. It reads well when it found the entire line in PDF. Problems come when they found a table inside the PDF.
It first gets into column1 and reads the line and jumps into column2 and reads that line and so on. Problem is column1 has paragraph string and column2 has paragraph string. It breaks those paragraph into single different lines which have no meaning.
I want it to work like go to column1 read paragraph and if it find new paragraph after newline then read the paragraph from second line. After processing column1 then jumps into colum2.
Currently I am using below code:
PdfReader reader = new PdfReader(@"D:\pdf1.pdf");
int PageNum = reader.NumberOfPages;
StringBuilder text = new StringBuilder();
for (int i = 1; i <= PageNum; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
ReadContent(text.ToString());
text.Clear();
}