Extract text by line from PDF using iTextSharp c#

Question

I need to run some analysis my extracting data from a PDF document.

Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.

Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.

Below is the code I used:

       string urlFileName1 = "pdf_link";
        PdfReader reader = new PdfReader(urlFileName1);
        string text = string.Empty;
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, page);
        }
        reader.Close();
        candidate3.Text = text.ToString();

`Xander` a few questions.. first off does the `PdfReader(urFileName`)` does that read all of the lines at once during that call..? if so then you need to probably change that for loop to a while loop and call the `reader.ReadLine()` method .. I am looking how one would read normally using the StreamReader Class let me know if there is a .ReadLine() Method [Reading PDF Content](http://stackoverflow.com/questions/2550796/reading-pdf-content-with-itextsharp-dll-in-vb-net-or-c-sharp) check this link out — MethodMan, Apr 01 '13 at 18:07
Hi @DJKRAZE Yes the PdfReader(urlFileName1) read all the lines at once. i dont think there is a .ReadLine() method in iTextSharp. Went to their [API](http://api.itextpdf.com/itext/) and couldnt find it. Could you do a sample to show what you mean with the while loop? — Xander, Apr 01 '13 at 18:25
look at this `Previous Stackoverflow` posting it should point you in the right direction http://stackoverflow.com/questions/2550796/reading-pdf-content-with-itextsharp-dll-in-vb-net-or-c-sharp — MethodMan, Apr 01 '13 at 18:29
`PdfTextExtractor.GetTextFromPage(reader, page)` uses the `LocationTextExtractionStrategy` which in turn does insert `'\n'` whenever the text line changes. If it does not for you, something is fishy. Could you, therefore, supply the PDF for inspection? — mkl, Apr 01 '13 at 23:00
hi @mkl Im not sure whether it does insert the '\n' because when i print out the text in my browser, it shows a long string. Could it be because of the way i add the text is wrong? If so, how should i add the text in order to display or split every line with a '\n' and store them in an array instead of String? This is my [PDF](https://www.dropbox.com/s/66q8i456vgliutu/Sample-profile.pdf) for inspection — Xander, Apr 02 '13 at 03:38
hi @VahidN that works great. How do i able to store each line in an array? — Xander, Apr 02 '13 at 05:53
Either you split the string at the new line characters or you create your own RenderListener which directly crates string arrays. — mkl, Apr 02 '13 at 08:30

score 14 · Answer 1 · answered Jan 02 '15 at 13:13

    public void ExtractTextFromPdf(string path)
    {
        using (PdfReader reader = new PdfReader(path))
        {
            StringBuilder text = new StringBuilder();
            ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                string page = "";

                page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
                string[] lines = page.Split('\n');
                foreach (string line in lines)
                {
                    MessageBox.Show(line);
                }
            }
        }
    }

When posting answers, always include some summary about how your code works and what it exactly does. Simply posting a code snippet is usually not enough. — Robert Rossmann, Jan 02 '15 at 13:18

score 4 · Answer 2 · answered Aug 16 '18 at 13:25

I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:

using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFApp2
{
class Program
{
    static void Main(string[] args)
    {

        string filePath = @"Your said path\the file name.pdf";
        string outPath = @"the output said path\the text file name.txt";
        int pagesToScan = 2;

        string strText = string.Empty;
        try
        {
            PdfReader reader = new PdfReader(filePath);

            for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
            {
                ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                //creating the string array and storing the PDF line by line
                string[] lines = strText.Split('\n');
                foreach (string line in lines)
                {
                    //Creating and appending to a text file
                    using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
                    {
                        file.WriteLine(line);
                    }
                }
            }

            reader.Close();
        }
        catch (Exception ex)
        {
            Console.Write(ex);
        }
    }
}
}

I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.

score 3 · Answer 3 · answered May 05 '20 at 06:02

3

All the other code samples here didn't work for me, probably due to changes to the itext7 API.

This minimal example here works ok:

var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

answered May 05 '20 at 06:02

dodgy_coder

12,407
10
54
67

1

GetTextFromPage has an overload that allows you to pass the ITextExtractionStrategy as well. – Jan Van der Haegen May 21 '20 at 14:17

score 1 · Answer 4 · answered Jul 06 '17 at 20:44

LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't. In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method

public virtual bool SameLine(ITextChunkLocation other) {
            return OrientationMagnitude == other.OrientationMagnitude &&
                   DistPerpendicular == other.DistPerpendicular;
        }

In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10

Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class

score 0 · Answer 5 · answered Mar 26 '14 at 10:00

Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.

ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;

score -2 · Answer 6 · edited Aug 13 '13 at 08:10

-2

Try

 String page = PdfTextExtractor.getTextFromPage(reader, 2);
 String s1[]=page.split("\n");

edited Aug 13 '13 at 08:10

ridoy

6,274
2
29
60

answered May 09 '13 at 12:52

adebayo

41
1

Extract text by line from PDF using iTextSharp c#

6 Answers6