Reading PDF documents in .Net

Question

Is there an open source library that will help me with reading/parsing PDF documents in .NET/C#?

More updated iTextSharp answers [here](https://stackoverflow.com/questions/2550796/reading-pdf-content-with-itextsharp-dll-in-vb-net-or-c-sharp) since this question is closed. — VDWWD, Jan 09 '20 at 23:32

score 132 · Accepted Answer · edited Sep 18 '12 at 23:33

132

Since this question was last answered in 2008, iTextSharp has improved their api dramatically. If you download the latest version of their api from http://sourceforge.net/projects/itextsharp/, you can use the following snippet of code to extract all text from a pdf into a string.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PdfParser
{
    public static class PdfTextExtractor
    {
        public static string pdfText(string path)
        {
            PdfReader reader = new PdfReader(path);
            string text = string.Empty;
            for(int page = 1; page <= reader.NumberOfPages; page++)
            {
                text += PdfTextExtractor.GetTextFromPage(reader,page);
            }
            reader.Close();
            return text;
        }   
    }
}

edited Sep 18 '12 at 23:33

ptilton

142
3
13

answered Nov 02 '11 at 18:30

Brock Nusser

1,336
1
9
4

18

You probably shouldn't call your class `PdfTextExtractor` as it will clash with the one in `iTextSharp.text.pdf.parser` – Neil Jun 12 '12 at 15:25
2

iTextSharp has moved to GitHub: http://github.com/itext/itextsharp – Amedee Van Gasse Dec 09 '15 at 15:38
7

It is now paid for commercial projects. – Nikolay Kostov Jun 23 '16 at 11:11
Licenced AGPL so that it can be used to create commercial software only if it is also AGPL licenced. If you want to develop commercial, proprietary software you must pay. – Sylwester Santorowski Jul 19 '19 at 08:45
3

@iTextSharp has been deprecated and replaced with iText 7 https://github.com/itext/itext7-dotnet. – Matthew Feb 27 '20 at 21:13
** BEWARE ** If you're writing for a commercial company, this is off the table (cost prohibitive by a factor of 10 vs. alternative products). AGPL3 is, in all practicality NOT open source unless your project and ALL its consumers is too. It is a tool for companies who use it to appear open source at the start, and make a huge profit directly off the software anyway. Not that people don't deserve to be paid for working, of course they do if they want to! But this is bait and switch. so BEWARE – FastAl Nov 23 '22 at 19:24
@FastAl I'm not sure if this problem can be solved using iText 4, but AGPL license was introduced with version 5. Previous versions were available under LGPL and MPL. So a version of iText is available for free (under a convenient license), although it might be somewhat outdated. – jahu Feb 03 '23 at 10:16

score 64 · Answer 2 · edited Jun 03 '19 at 08:49

iTextSharp is the best bet. Used it to make a spider for lucene.Net so that it could crawl PDF.

using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;

namespace Spider.Utils
{
    /// <summary>
    /// Parses a PDF file and extracts the text from it.
    /// </summary>
    public class PDFParser
    {
        /// BT = Beginning of a text object operator 
        /// ET = End of a text object operator
        /// Td move to the start of next line
        ///  5 Ts = superscript
        /// -5 Ts = subscript

        #region Fields

        #region _numberOfCharsToKeep
        /// <summary>
        /// The number of characters to keep, when extracting text.
        /// </summary>
        private static int _numberOfCharsToKeep = 15;
        #endregion

        #endregion

        #region ExtractText
        /// <summary>
        /// Extracts a text from a PDF file.
        /// </summary>
        /// <param name="inFileName">the full path to the pdf file.</param>
        /// <param name="outFileName">the output file name.</param>
        /// <returns>the extracted text</returns>
        public bool ExtractText(string inFileName, string outFileName)
        {
            StreamWriter outFile = null;
            try
            {
                // Create a reader for the given PDF file
                PdfReader reader = new PdfReader(inFileName);
                //outFile = File.CreateText(outFileName);
                outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);

                Console.Write("Processing: ");

                int totalLen = 68;
                float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
                int totalWritten = 0;
                float curUnit = 0;

                for (int page = 1; page <= reader.NumberOfPages; page++)
                {
                    outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");

                    // Write the progress.
                    if (charUnit >= 1.0f)
                    {
                        for (int i = 0; i < (int)charUnit; i++)
                        {
                            Console.Write("#");
                            totalWritten++;
                        }
                    }
                    else
                    {
                        curUnit += charUnit;
                        if (curUnit >= 1.0f)
                        {
                            for (int i = 0; i < (int)curUnit; i++)
                            {
                                Console.Write("#");
                                totalWritten++;
                            }
                            curUnit = 0;
                        }

                    }
                }

                if (totalWritten < totalLen)
                {
                    for (int i = 0; i < (totalLen - totalWritten); i++)
                    {
                        Console.Write("#");
                    }
                }
                return true;
            }
            catch
            {
                return false;
            }
            finally
            {
                if (outFile != null) outFile.Close();
            }
        }
        #endregion

        #region ExtractTextFromPDFBytes
        /// <summary>
        /// This method processes an uncompressed Adobe (text) object 
        /// and extracts text.
        /// </summary>
        /// <param name="input">uncompressed</param>
        /// <returns></returns>
        public string ExtractTextFromPDFBytes(byte[] input)
        {
            if (input == null || input.Length == 0) return "";

            try
            {
                string resultString = "";

                // Flag showing if we are we currently inside a text object
                bool inTextObject = false;

                // Flag showing if the next character is literal 
                // e.g. '\\' to get a '\' character or '\(' to get '('
                bool nextLiteral = false;

                // () Bracket nesting level. Text appears inside ()
                int bracketDepth = 0;

                // Keep previous chars to get extract numbers etc.:
                char[] previousCharacters = new char[_numberOfCharsToKeep];
                for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';


                for (int i = 0; i < input.Length; i++)
                {
                    char c = (char)input[i];
                    if (input[i] == 213)
                        c = "'".ToCharArray()[0];

                    if (inTextObject)
                    {
                        // Position the text
                        if (bracketDepth == 0)
                        {
                            if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                            {
                                resultString += "\n\r";
                            }
                            else
                            {
                                if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
                                {
                                    resultString += "\n";
                                }
                                else
                                {
                                    if (CheckToken(new string[] { "Tj" }, previousCharacters))
                                    {
                                        resultString += " ";
                                    }
                                }
                            }
                        }

                        // End of a text object, also go to a new line.
                        if (bracketDepth == 0 &&
                            CheckToken(new string[] { "ET" }, previousCharacters))
                        {

                            inTextObject = false;
                            resultString += " ";
                        }
                        else
                        {
                            // Start outputting text
                            if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                            {
                                bracketDepth = 1;
                            }
                            else
                            {
                                // Stop outputting text
                                if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                                {
                                    bracketDepth = 0;
                                }
                                else
                                {
                                    // Just a normal text character:
                                    if (bracketDepth == 1)
                                    {
                                        // Only print out next character no matter what. 
                                        // Do not interpret.
                                        if (c == '\\' && !nextLiteral)
                                        {
                                            resultString += c.ToString();
                                            nextLiteral = true;
                                        }
                                        else
                                        {
                                            if (((c >= ' ') && (c <= '~')) ||
                                                ((c >= 128) && (c < 255)))
                                            {
                                                resultString += c.ToString();
                                            }

                                            nextLiteral = false;
                                        }
                                    }
                                }
                            }
                        }
                    }

                    // Store the recent characters for 
                    // when we have to go back for a checking
                    for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
                    {
                        previousCharacters[j] = previousCharacters[j + 1];
                    }
                    previousCharacters[_numberOfCharsToKeep - 1] = c;

                    // Start of a text object
                    if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
                    {
                        inTextObject = true;
                    }
                }

                return CleanupContent(resultString);
            }
            catch
            {
                return "";
            }
        }

        private string CleanupContent(string text)
        {
            string[] patterns = { @"\\\(", @"\\\)", @"\\226", @"\\222", @"\\223", @"\\224", @"\\340", @"\\342", @"\\344", @"\\300", @"\\302", @"\\304", @"\\351", @"\\350", @"\\352", @"\\353", @"\\311", @"\\310", @"\\312", @"\\313", @"\\362", @"\\364", @"\\366", @"\\322", @"\\324", @"\\326", @"\\354", @"\\356", @"\\357", @"\\314", @"\\316", @"\\317", @"\\347", @"\\307", @"\\371", @"\\373", @"\\374", @"\\331", @"\\333", @"\\334", @"\\256", @"\\231", @"\\253", @"\\273", @"\\251", @"\\221"};
            string[] replace = {   "(",     ")",      "-",     "'",      "\"",      "\"",    "à",      "â",      "ä",      "À",      "Â",      "Ä",      "é",      "è",      "ê",      "ë",      "É",      "È",      "Ê",      "Ë",      "ò",      "ô",      "ö",      "Ò",      "Ô",      "Ö",      "ì",      "î",      "ï",      "Ì",      "Î",      "Ï",      "ç",      "Ç",      "ù",      "û",      "ü",      "Ù",      "Û",      "Ü",      "®",      "™",      "«",      "»",      "©",      "'" };

            for (int i = 0; i < patterns.Length; i++)
            {
                string regExPattern = patterns[i];
                Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
                text = regex.Replace(text, replace[i]);
            }

            return text;
        }

        #endregion

        #region CheckToken
        /// <summary>
        /// Check if a certain 2 character token just came along (e.g. BT)
        /// </summary>
        /// <param name="tokens">the searched token</param>
        /// <param name="recent">the recent character array</param>
        /// <returns></returns>
        private bool CheckToken(string[] tokens, char[] recent)
        {
            foreach (string token in tokens)
            {
                if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                    (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                    ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                    ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0a))
                    )
                {
                    return true;
                }
            }
            return false;
        }
        #endregion
    }
}

hello ceetheman, i tried to use the code you have provided above... but getting one problem. my some of pdf files are read properly but in some pdf file i got the error "Index Out of Range" in the function "CheckToken". can you please help me to resolve this? — Radhi, Feb 22 '10 at 12:38
Referencing the source of your example is a good & polite idea. In this case the same source code can be found here http://www.codeproject.com/KB/cs/PDFToText.aspx — Myster, Apr 29 '10 at 00:22
I have problems with this code, it returns gobledegook made up of the letters r and n. I used PDFBox in the end. — Myster, Apr 29 '10 at 03:39
So weird... I plugged in my pdf and I got 1627 empty lines in my text file... — Ortund, Oct 05 '17 at 11:18
The answer provided by Brock Nusser looks like the most up-to-date solution and should be considered as being the right answer for this question. — ceetheman, Jan 11 '18 at 15:38
Hello! I really like this solution and works perfectly smooth on English PDF documents. However using french PDF documents I get a lot of "\036" and "\037" on the text file. I noticed that you use the CleanupContent to clean your document. But using the same logic, the "\037" can be an "F" or " " so I'm a little bit lost. Can you please explain more the usage of CleanupContent? Thank you. — Oussama melki, Sep 07 '19 at 20:34
** BEWARE ** If you're writing for a commercial company, this is off the table (cost prohibitive by a factor of 10 vs. alternative products). AGPL3 is, in all practicality NOT open source unless your project and ALL its consumers is too. It is a tool for companies who use it to appear open source at the start, and make a huge profit directly off the software anyway. Not that people don't deserve to be paid for working, of course they do if they want to! But this is bait and switch. so BEWARE — FastAl, Nov 23 '22 at 19:24

score 7 · Answer 3 · edited Apr 21 '20 at 20:20

7

PDFClown might help, but I would not recommend it for a big or heavy use application.

edited Apr 21 '20 at 20:20

John Smith

7,243
6
49
61

answered Sep 17 '08 at 13:29

Ilya Kochetov

17,988
6
44
60

Licenced LGPL so that it can be used to create commercial, proprietary software. – Sylwester Santorowski Jul 19 '19 at 08:43

score 6 · Answer 4 · edited Mar 29 '11 at 09:38

6

public string ReadPdfFile(object Filename, DataTable ReadLibray)
{
    PdfReader reader2 = new PdfReader((string)Filename);
    string strText = string.Empty;

    for (int page = 1; page <= reader2.NumberOfPages; page++)
    {
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
    PdfReader reader = new PdfReader((string)Filename);
    String s = PdfTextExtractor.GetTextFromPage(reader, page, its);

    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
    strText = strText + s;
    reader.Close();
    }
    return strText;
}

edited Mar 29 '11 at 09:38

Rafal Spacjer

4,838
2
26
34

answered Feb 15 '11 at 11:23

ShravankumarKumar

2,025
1
13
2

PdfReader? Pls add some information. – DxTx Aug 20 '18 at 03:01
1

@DT see [iTextSharp](https://github.com/itext/itextsharp) – dontbyteme Oct 11 '18 at 08:25

score 3 · Answer 5 · answered Sep 17 '08 at 13:33

3

iText is the best library I know. Originally written in Java, there is a .NET port as well.

See http://www.ujihara.jp/iTextdotNET/en/

answered Sep 17 '08 at 13:33

That is not an official port, and the link is broken anyway. The official .NET port of iText, iTextSharp, can be found on GitHub: http://github.com/itext/itextsharp – Amedee Van Gasse Dec 09 '15 at 15:39

score 2 · Answer 6 · answered Aug 30 '12 at 02:09

2

itext?

http://www.itextpdf.com/terms-of-use/index.php

Guide

http://www.vogella.com/articles/JavaPDF/article.html

answered Aug 30 '12 at 02:09

Dobermaxx99

318
5
16

score 1 · Answer 7 · answered Sep 17 '08 at 13:27

1

You could look into this: http://www.codeproject.com/KB/showcase/pdfrasterizer.aspx It's not completely free, but it looks very nice.

Alex

answered Sep 17 '08 at 13:27

Alex Fort

18,459
5
42
51

1

Can this help to convert PDF to raw text? Seems that tool converts it into an image. So i need an OCR library then :-) – JRoppert Sep 17 '08 at 13:33

score 1 · Answer 8 · answered Sep 17 '08 at 13:45

1

http://www.c-sharpcorner.com/UploadFile/psingh/PDFFileGenerator12062005235236PM/PDFFileGenerator.aspx is open source and may be a good starting point for you.

answered Sep 17 '08 at 13:45

Ben McEvoy

654
4
8

score 1 · Answer 9 · answered Sep 17 '08 at 15:27

1

aspose pdf works pretty well. then again, you have to pay for it

answered Sep 17 '08 at 15:27

Kuvo

2,351
2
14
10

Bobrovsky · Answer 10 · 2020-08-07T12:32:04.397

0

Have a look at Docotic.Pdf library. It does not require you to make source code of your application open (like iTextSharp with viral AGPL 3 license, for example).

Docotic.Pdf can be used to read PDF files and extract text with or without formatting. Please have a look at the article that shows how to extract text from PDFs.

Disclaimer: I work for Bit Miracle, vendor of the library.

edited Aug 07 '20 at 12:32

answered Dec 14 '11 at 16:59

Bobrovsky

13,789
19
80
130

6

Only 30 days free. Not a good option... – José Augustinho Jul 04 '18 at 17:07

score -1 · Answer 11 · answered Sep 17 '08 at 13:31

-1

There is also LibHaru

http://libharu.org/wiki/Main_Page

answered Sep 17 '08 at 13:31

Cetra

2,593
1
21
27

Link broken. http://libharu.org/ – TernaryTopiary May 08 '17 at 06:37
1

Also: "At this moment libHaru does not support reading and editing existing PDF files and it's unlikely this support will ever appear." Is this actually relevant? – TernaryTopiary May 08 '17 at 06:38

Reading PDF documents in .Net

11 Answers11

Linked

Related