How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.
-
iTextSharp is now called "iText 7 for .NET"or "itext7-dotnet" on github: [link](https://github.com/itext/itext7-dotnet). It's recommended to add itext7 with Nuget to your solution. – Peter Huber Aug 29 '20 at 04:43
6 Answers
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

- 3
- 1

- 2,025
- 1
- 13
- 2
-
1Any particular reason the pdfReader.Close(); happens inside the for loop? – Th 00 mÄ s Nov 16 '12 at 09:53
-
10
-
2Also, `ASCIIEncoding.Convert` should be `Encoding.Convert` as it is a static method – Sebastian Jul 23 '13 at 09:21
LGPL / FOSS iTextSharp 4.x
var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);
None of the other answers were useful to me, they all seem to target the AGPL v5 of iTextSharp. I could never find any reference to SimpleTextExtractionStrategy
or LocationTextExtractionStrategy
in the FOSS version.
Something else that might be very useful in conjunction with this:
const string PdfTableFormat = @"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);
List<string> ExtractPdfContent(string rawPdfContent)
{
var matches = PdfTableRegex.Matches(rawPdfContent);
var list = matches.Cast<Match>()
.Select(m => m.Value
.Substring(1) //remove leading (
.Remove(m.Value.Length - 4) //remove trailing )Tj
.Replace(@"\)", ")") //unencode parens
.Replace(@"\(", "(")
.Trim()
)
.ToList();
return list;
}
This will extract the text-only data from the PDF if the text displayed is Foo(bar)
it will be encoded in the PDF as (Foo\(bar\))Tj
, this method would return Foo(bar)
as expected. This method will strip out lots of additional information such as location coordinates from the raw pdf content.

- 2,879
- 3
- 19
- 35

- 32,487
- 24
- 164
- 258
-
1You are right, before 5.x.x text extraction was present in iText merely as proof-of-concept and in iTextSharp not at all. That being said, the code you present only works in very primitively built PDFs (using fonts with an ASCII'ish encoding and **Tj** as only text drawing operator). It may be usable in very controlled environments (in which you can ensure to only get such primitive PDFs) but not in general. – mkl Nov 04 '14 at 16:55
-
Here is a VB.NET solution based on ShravankumarKumar's solution.
This will ONLY give you the text. The images are a different story.
Public Shared Function GetTextFromPDF(PdfFileName As String) As String
Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)
Dim sOut = ""
For i = 1 To oReader.NumberOfPages
Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
Next
Return sOut
End Function

- 11,857
- 5
- 62
- 68
-
When I try this on my PDF, it gives me the error message, "Value cannot be null. Parameter name: value". Any idea what this is about? – Avi Sep 01 '11 at 19:38
-
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its). Also, I figured something out about this error. If I take it out of the loop and parse the individual pages, it works on one page and not the other. The only difference between the two that I can tell is that the problematic page has images on it (which I don't need). – Avi Sep 01 '11 at 19:53
-
-
I'm using .Net 4.0 and itextsharp 5.1.2.0 (Just downloaded). Same with you? – Carter Medlin Sep 01 '11 at 19:58
-
In my case, I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it. In the sample below the coordinates are for the entire page. I don't have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found.
Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner. 72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);
As noted by the above comments the resulting text doesn't maintain any of the formatting found in the PDF document, however, I was happy that it did preserve the carriage returns. In my case, there were enough constants in the text that I was able to extract the values that I required.
Here an improved answer of ShravankumarKumar. I created special classes for the pages so you can access words in the pdf based on the text rows and the word in that row.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
//create a list of pdf pages
var pages = new List<PdfPage>();
//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
//loop all the pages and extract the text
for (int i = 1; i <= reader.NumberOfPages; i++)
{
pages.Add(new PdfPage()
{
content = PdfTextExtractor.GetTextFromPage(reader, i)
});
}
}
//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y =>
new PdfRow() {
content = y,
words = y.Split(' ').ToList()
}
).ToList());
The custom classes
class PdfPage
{
public string content { get; set; }
public List<PdfRow> rows { get; set; }
}
class PdfRow
{
public string content { get; set; }
public List<string> words { get; set; }
}
Now you can get a word by row and word index.
string myWord = pages[0].rows[12].words[4];
Or use Linq to find the rows containing a specific word.
//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();
//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();

- 35,079
- 22
- 62
- 79
Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
Dim sr As StreamReader = New StreamReader(sTxtfile)
Dim doc As New Document()
PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
doc.Open()
doc.Add(New Paragraph(sr.ReadToEnd()))
doc.Close()
End Sub