0

I was using this piece of code till today and it was working fine:

for (int page = 1; page <= reader.NumberOfPages; page++)
{
    var cpage = reader.GetPageN(page);
    var content = cpage.Get(PdfName.CONTENTS);

    var ir = (PRIndirectReference)content;

    var value = reader.GetPdfObject(ir.Number);

    if (value.IsStream())
    {
        PRStream stream = (PRStream)value;

        var streamBytes = PdfReader.GetStreamBytes(stream);

        var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

        try
        {
            while (tokenizer.NextToken())
            {
                if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                {
                    string strs = tokenizer.StringValue;

                    if (!(br = excludeList.Any(st => strs.Contains(st))))
                    {
                        //strfor += tokenizer.StringValue;

                        if (!string.IsNullOrWhiteSpace(strs) &&
                            !stringsList.Any(i => i == strs && excludeHeaders.Contains(strs)))
                            stringsList.Add(strs);
                    }
                }
            }
        }
        finally
        {
            tokenizer.Close();
        }
    }
}

But today I got an exception for some pdf file: Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRIndirectReference

On debugging I got to know that the error is at this line: var ir = (PRIndirectReference)content;. That's because the pdf content that I'm extracting, I get it in the form of ArrayList, as you can see from the below image:

content

It would be really grateful if anyone can help me with this. Thanks in advance.

EDIT :

The pdf contents are paragraphs, tables, headers & footers, images in few cases. But I'm not bothered of images as I'm bypassing them.

As you can see from the code I'm trying to add the words into a string list, so I expect the output as plain text; words to be specific.

StackUseR
  • 884
  • 1
  • 11
  • 40
  • Essentially the (meanwhile deleted) [answer](https://stackoverflow.com/a/68063163/1729265) by Akshay Gaonkar gave the correct hint: Your existing code ignores that the page contents may either be a single (always indirect) stream or a (direct or indirect) array of (always indirect) streams. Simply **check the type** of `content` and treat arrays differently than streams. Furthermore, why don't you simply use the iText text extraction frameworks which takes care of details? – mkl Jun 21 '21 at 09:37
  • @mkl: I agree with what Akshay said! I did check the content type. But what I'm unable to do is, when I parse through the array the result I get in the string list is something like this: `"De","s","k","t","op"," ","P","C"," ","wi","t","h","i","n"," ","t","h","e"," ",`... unfortunately I'm unable to solve this. – StackUseR Jun 21 '21 at 10:28
  • use `PdfTextExtractor`. https://stackoverflow.com/a/5003230/4018180 – Akshay G Jun 21 '21 at 11:26
  • hi Akshay! Again, thanks for replying. I tried that solution, but its the same result I'm getting. I tried almost many solutions on stackoverflow, but seems to be not helpful. I tried to replicate the pdf by copy-pasting the contents of my pdf into word & creating a new one. But unfortunately its not the pdf I guess, unable to figure out. – StackUseR Jun 21 '21 at 12:16
  • *'But what I'm unable to do is, when I parse through the array the result I get in the string list is something like this: `"De","s","k","t","op"," ","P","C"," ","wi","t","h","i","n"," ","t","h","e"," ",...` unfortunately I'm unable to solve this.'* - Your code extracts the strings from the page content streams. If you get so small strings, that means that text is drawn in so small packages. Your assumption that each of those strings represents a word only holds for documents in which each word is drawn as a whole by itself. This is not very common. – mkl Jun 21 '21 at 18:12
  • Furthermore, why don't you simply use the iText text extraction frameworks which takes care of details? – mkl Jun 21 '21 at 18:13
  • @mkl: ok. I guess the iText version that I'm using is not latest, may be 5.x or probably 4.x. I will check on that. `why don't you simply use the iText text extraction frameworks which takes care of details?` - I'm really sorry, I don't know what exactly that means. if you can point me into right direction on how to get each word as a whole, that would be really thankful. – StackUseR Jun 22 '21 at 05:10
  • @AshishSrivastava The Solution I shared returns a concatenated string not a list of strings. If only you need a list of words split with whitespace after extracting the text. – Akshay G Jun 22 '21 at 05:57
  • @AkshayGaonkar: yes it does but as I commented above it returns partial strings & not as whole word. Again my pdf contains at least 15k words, but with your solution even if I work on strings that it returns, I get only 5k words. So its partial. – StackUseR Jun 22 '21 at 06:33
  • *"if you can point me into right direction on how to get each word as a whole, that would be really thankful."* - The text extraction framework has been added to iText in the early 5.0.x versions. Thus, if you indeed use 4.x, it is obvious why you cannot use it. If you use a 5.x, look for the `PdfTextExtractor` class. – mkl Jun 22 '21 at 09:08
  • @mkl: yes it is 4.x version so I understand now why I cannot found `PdfTextExtractor`. – StackUseR Jun 22 '21 at 09:10
  • I already pointed to the `PdfTextExtractor` in the comment 22 hours ago and you said you tried it. – Akshay G Jun 22 '21 at 10:13
  • I couldn't make out about `PdfTextExtractor` as I truly forgot about the version! I tried searching for `PdfTextExtractor` in my version of iText but couldn't figure out why I'm not hitting it! plus there was no accepted answer on that thread. But there was an LGPL 4.x version which I thought might be useful but no luck. yes! I'm in desperate need of a support on this.., solution probably! – StackUseR Jun 22 '21 at 10:37

1 Answers1

1

That was real easy! Don't know why I couldn't make out.

PdfReader reader = new PdfReader(name);
List<string> stringsList = new List<string>();

for (int page = 1; page <= reader.NumberOfPages; page++)
{
    //directly get the contents into a byte stream
    var streamByte = reader.GetPageContent(page);
    var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamByte));
    var sb = new StringBuilder(); //use a string builder instead

    try
    {
        while (tokenizer.NextToken())
        {
            if (tokenizer.TokenType == PRTokeniser.TK_STRING)
            {
                var currentText = tokenizer.StringValue;
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                sb.Append(tokenizer.StringValue);
            }
        }
    }
    finally
    {
        //add appended strings into a string list
        if(sb != null)
            stringsList.Add(sb.ToString());

        tokenizer.Close();
    }
}
StackUseR
  • 884
  • 1
  • 11
  • 40