Is there an open source library that will help me with reading/parsing PDF documents in .NET/C#?
-
More updated iTextSharp answers [here](https://stackoverflow.com/questions/2550796/reading-pdf-content-with-itextsharp-dll-in-vb-net-or-c-sharp) since this question is closed. – VDWWD Jan 09 '20 at 23:32
11 Answers
Since this question was last answered in 2008, iTextSharp has improved their api dramatically. If you download the latest version of their api from http://sourceforge.net/projects/itextsharp/, you can use the following snippet of code to extract all text from a pdf into a string.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PdfParser
{
public static class PdfTextExtractor
{
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
}
}

- 142
- 3
- 13

- 1,336
- 1
- 9
- 4
-
18You probably shouldn't call your class `PdfTextExtractor` as it will clash with the one in `iTextSharp.text.pdf.parser` – Neil Jun 12 '12 at 15:25
-
2iTextSharp has moved to GitHub: http://github.com/itext/itextsharp – Amedee Van Gasse Dec 09 '15 at 15:38
-
7
-
Licenced AGPL so that it can be used to create commercial software only if it is also AGPL licenced. If you want to develop commercial, proprietary software you must pay. – Sylwester Santorowski Jul 19 '19 at 08:45
-
3@iTextSharp has been deprecated and replaced with iText 7 https://github.com/itext/itext7-dotnet. – Matthew Feb 27 '20 at 21:13
-
** BEWARE ** If you're writing for a commercial company, this is off the table (cost prohibitive by a factor of 10 vs. alternative products). AGPL3 is, in all practicality NOT open source unless your project and ALL its consumers is too. It is a tool for companies who use it to appear open source at the start, and make a huge profit directly off the software anyway. Not that people don't deserve to be paid for working, of course they do if they want to! But this is bait and switch. so BEWARE – FastAl Nov 23 '22 at 19:24
-
@FastAl I'm not sure if this problem can be solved using iText 4, but AGPL license was introduced with version 5. Previous versions were available under LGPL and MPL. So a version of iText is available for free (under a convenient license), although it might be somewhat outdated. – jahu Feb 03 '23 at 10:16
iTextSharp is the best bet. Used it to make a spider for lucene.Net so that it could crawl PDF.
using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;
namespace Spider.Utils
{
/// <summary>
/// Parses a PDF file and extracts the text from it.
/// </summary>
public class PDFParser
{
/// BT = Beginning of a text object operator
/// ET = End of a text object operator
/// Td move to the start of next line
/// 5 Ts = superscript
/// -5 Ts = subscript
#region Fields
#region _numberOfCharsToKeep
/// <summary>
/// The number of characters to keep, when extracting text.
/// </summary>
private static int _numberOfCharsToKeep = 15;
#endregion
#endregion
#region ExtractText
/// <summary>
/// Extracts a text from a PDF file.
/// </summary>
/// <param name="inFileName">the full path to the pdf file.</param>
/// <param name="outFileName">the output file name.</param>
/// <returns>the extracted text</returns>
public bool ExtractText(string inFileName, string outFileName)
{
StreamWriter outFile = null;
try
{
// Create a reader for the given PDF file
PdfReader reader = new PdfReader(inFileName);
//outFile = File.CreateText(outFileName);
outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);
Console.Write("Processing: ");
int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");
// Write the progress.
if (charUnit >= 1.0f)
{
for (int i = 0; i < (int)charUnit; i++)
{
Console.Write("#");
totalWritten++;
}
}
else
{
curUnit += charUnit;
if (curUnit >= 1.0f)
{
for (int i = 0; i < (int)curUnit; i++)
{
Console.Write("#");
totalWritten++;
}
curUnit = 0;
}
}
}
if (totalWritten < totalLen)
{
for (int i = 0; i < (totalLen - totalWritten); i++)
{
Console.Write("#");
}
}
return true;
}
catch
{
return false;
}
finally
{
if (outFile != null) outFile.Close();
}
}
#endregion
#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
public string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";
try
{
string resultString = "";
// Flag showing if we are we currently inside a text object
bool inTextObject = false;
// Flag showing if the next character is literal
// e.g. '\\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;
// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;
// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (input[i] == 213)
c = "'".ToCharArray()[0];
if (inTextObject)
{
// Position the text
if (bracketDepth == 0)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
{
resultString += "\n\r";
}
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
{
resultString += "\n";
}
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
{
resultString += " ";
}
}
}
}
// End of a text object, also go to a new line.
if (bracketDepth == 0 &&
CheckToken(new string[] { "ET" }, previousCharacters))
{
inTextObject = false;
resultString += " ";
}
else
{
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
{
bracketDepth = 0;
}
else
{
// Just a normal text character:
if (bracketDepth == 1)
{
// Only print out next character no matter what.
// Do not interpret.
if (c == '\\' && !nextLiteral)
{
resultString += c.ToString();
nextLiteral = true;
}
else
{
if (((c >= ' ') && (c <= '~')) ||
((c >= 128) && (c < 255)))
{
resultString += c.ToString();
}
nextLiteral = false;
}
}
}
}
}
}
// Store the recent characters for
// when we have to go back for a checking
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
{
previousCharacters[j] = previousCharacters[j + 1];
}
previousCharacters[_numberOfCharsToKeep - 1] = c;
// Start of a text object
if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
{
inTextObject = true;
}
}
return CleanupContent(resultString);
}
catch
{
return "";
}
}
private string CleanupContent(string text)
{
string[] patterns = { @"\\\(", @"\\\)", @"\\226", @"\\222", @"\\223", @"\\224", @"\\340", @"\\342", @"\\344", @"\\300", @"\\302", @"\\304", @"\\351", @"\\350", @"\\352", @"\\353", @"\\311", @"\\310", @"\\312", @"\\313", @"\\362", @"\\364", @"\\366", @"\\322", @"\\324", @"\\326", @"\\354", @"\\356", @"\\357", @"\\314", @"\\316", @"\\317", @"\\347", @"\\307", @"\\371", @"\\373", @"\\374", @"\\331", @"\\333", @"\\334", @"\\256", @"\\231", @"\\253", @"\\273", @"\\251", @"\\221"};
string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "â", "ä", "À", "Â", "Ä", "é", "è", "ê", "ë", "É", "È", "Ê", "Ë", "ò", "ô", "ö", "Ò", "Ô", "Ö", "ì", "î", "ï", "Ì", "Î", "Ï", "ç", "Ç", "ù", "û", "ü", "Ù", "Û", "Ü", "®", "™", "«", "»", "©", "'" };
for (int i = 0; i < patterns.Length; i++)
{
string regExPattern = patterns[i];
Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
text = regex.Replace(text, replace[i]);
}
return text;
}
#endregion
#region CheckToken
/// <summary>
/// Check if a certain 2 character token just came along (e.g. BT)
/// </summary>
/// <param name="tokens">the searched token</param>
/// <param name="recent">the recent character array</param>
/// <returns></returns>
private bool CheckToken(string[] tokens, char[] recent)
{
foreach (string token in tokens)
{
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
{
return true;
}
}
return false;
}
#endregion
}
}
-
1hello ceetheman, i tried to use the code you have provided above... but getting one problem. my some of pdf files are read properly but in some pdf file i got the error "Index Out of Range" in the function "CheckToken". can you please help me to resolve this? – Radhi Feb 22 '10 at 12:38
-
19Referencing the source of your example is a good & polite idea. In this case the same source code can be found here http://www.codeproject.com/KB/cs/PDFToText.aspx – Myster Apr 29 '10 at 00:22
-
2I have problems with this code, it returns gobledegook made up of the letters r and n. I used PDFBox in the end. – Myster Apr 29 '10 at 03:39
-
So weird... I plugged in my pdf and I got 1627 empty lines in my text file... – Ortund Oct 05 '17 at 11:18
-
1The answer provided by Brock Nusser looks like the most up-to-date solution and should be considered as being the right answer for this question. – ceetheman Jan 11 '18 at 15:38
-
Hello! I really like this solution and works perfectly smooth on English PDF documents. However using french PDF documents I get a lot of "\036" and "\037" on the text file. I noticed that you use the CleanupContent to clean your document. But using the same logic, the "\037" can be an "F" or " " so I'm a little bit lost. Can you please explain more the usage of CleanupContent? Thank you. – Oussama melki Sep 07 '19 at 20:34
-
** BEWARE ** If you're writing for a commercial company, this is off the table (cost prohibitive by a factor of 10 vs. alternative products). AGPL3 is, in all practicality NOT open source unless your project and ALL its consumers is too. It is a tool for companies who use it to appear open source at the start, and make a huge profit directly off the software anyway. Not that people don't deserve to be paid for working, of course they do if they want to! But this is bait and switch. so BEWARE – FastAl Nov 23 '22 at 19:24
PDFClown might help, but I would not recommend it for a big or heavy use application.

- 7,243
- 6
- 49
- 61

- 17,988
- 6
- 44
- 60
-
Licenced LGPL so that it can be used to create commercial, proprietary software. – Sylwester Santorowski Jul 19 '19 at 08:43
public string ReadPdfFile(object Filename, DataTable ReadLibray)
{
PdfReader reader2 = new PdfReader((string)Filename);
string strText = string.Empty;
for (int page = 1; page <= reader2.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)Filename);
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
reader.Close();
}
return strText;
}

- 4,838
- 2
- 26
- 34

- 2,025
- 1
- 13
- 2
iText is the best library I know. Originally written in Java, there is a .NET port as well.
-
That is not an official port, and the link is broken anyway. The official .NET port of iText, iTextSharp, can be found on GitHub: http://github.com/itext/itextsharp – Amedee Van Gasse Dec 09 '15 at 15:39
You could look into this: http://www.codeproject.com/KB/showcase/pdfrasterizer.aspx It's not completely free, but it looks very nice.
Alex

- 18,459
- 5
- 42
- 51
-
1Can this help to convert PDF to raw text? Seems that tool converts it into an image. So i need an OCR library then :-) – JRoppert Sep 17 '08 at 13:33
http://www.c-sharpcorner.com/UploadFile/psingh/PDFFileGenerator12062005235236PM/PDFFileGenerator.aspx is open source and may be a good starting point for you.

- 654
- 4
- 8
Have a look at Docotic.Pdf library. It does not require you to make source code of your application open (like iTextSharp with viral AGPL 3 license, for example).
Docotic.Pdf can be used to read PDF files and extract text with or without formatting. Please have a look at the article that shows how to extract text from PDFs.
Disclaimer: I work for Bit Miracle, vendor of the library.

- 13,789
- 19
- 80
- 130
There is also LibHaru

- 2,593
- 1
- 21
- 27
-
-
1Also: "At this moment libHaru does not support reading and editing existing PDF files and it's unlikely this support will ever appear." Is this actually relevant? – TernaryTopiary May 08 '17 at 06:38