I was using this piece of code till today and it was working fine:
for (int page = 1; page <= reader.NumberOfPages; page++)
{
var cpage = reader.GetPageN(page);
var content = cpage.Get(PdfName.CONTENTS);
var ir = (PRIndirectReference)content;
var value = reader.GetPdfObject(ir.Number);
if (value.IsStream())
{
PRStream stream = (PRStream)value;
var streamBytes = PdfReader.GetStreamBytes(stream);
var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));
try
{
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TK_STRING)
{
string strs = tokenizer.StringValue;
if (!(br = excludeList.Any(st => strs.Contains(st))))
{
//strfor += tokenizer.StringValue;
if (!string.IsNullOrWhiteSpace(strs) &&
!stringsList.Any(i => i == strs && excludeHeaders.Contains(strs)))
stringsList.Add(strs);
}
}
}
}
finally
{
tokenizer.Close();
}
}
}
But today I got an exception for some pdf file: Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRIndirectReference
On debugging I got to know that the error is at this line: var ir = (PRIndirectReference)content;
. That's because the pdf content that I'm extracting, I get it in the form of ArrayList
, as you can see from the below image:
It would be really grateful if anyone can help me with this. Thanks in advance.
EDIT :
The pdf contents are paragraphs, tables, headers & footers, images in few cases. But I'm not bothered of images as I'm bypassing them.
As you can see from the code I'm trying to add the words into a string list, so I expect the output as plain text; words to be specific.