C# Extract text from PDF using PdfSharp

Question

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.

Just wondering, why downvotes? (There are no clarifying comments to help author to improve the question.) — TN., Dec 11 '12 at 07:28
You need to extract the ToUnicode CMaps from the document to convert the binary indexes of the text-strings, unless you're lucky and the binary indexes are ASCII values themselves. — R.J. Dunnill, Jun 03 '19 at 03:56

score 55 · Answer 1 · answered Jun 04 '14 at 19:37

55

Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.

public static class PdfSharpExtensions
{
    public static IEnumerable<string> ExtractText(this PdfPage page)
    {       
        var content = ContentReader.ReadContent(page);      
        var text = content.ExtractText();
        return text;
    }   

    public static IEnumerable<string> ExtractText(this CObject cObject)
    {   
        if (cObject is COperator)
        {
            var cOperator = cObject as COperator;
            if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                cOperator.OpCode.Name == OpCodeName.TJ.ToString())
            {
                foreach (var cOperand in cOperator.Operands)
                    foreach (var txt in ExtractText(cOperand))
                        yield return txt;   
            }
        }
        else if (cObject is CSequence)
        {
            var cSequence = cObject as CSequence;
            foreach (var element in cSequence)
                foreach (var txt in ExtractText(element))
                    yield return txt;
        }
        else if (cObject is CString)
        {
            var cString = cObject as CString;
            yield return cString.Value;
        }
    }
}

answered Jun 04 '14 at 19:37

Ronnie Overby

45,287
73
267
346

I am using PDFsharp library but it say ContentReader Class is out of context.What could be the problem? – Sudarshan Taparia Aug 31 '16 at 13:33
ContentReader Class is out of context. – Ronnie Overby Sep 01 '16 at 20:42
6

Couldn't resist. IDK what that means or how to fix it. I try to avoid working with PDF's like the plague because the tools to work with them are crap and pretending that a human readable format is machine readable is a total fools errand. – Ronnie Overby Sep 01 '16 at 20:43
1

PdfSharp v1.32.3057 has a bug where `ContentReader.ReadContent` hangs. To fix, there are some changes needed (see [here](http://forum.pdfsharp.net/viewtopic.php?p=7911#p7911)). After fixing the bug, I can confirm this works. :-) – Nicholas Miller Apr 05 '17 at 15:33
Namespace for `ContentReader` : `PdfSharp.Pdf.Content.ContentReader`. – NoOne Jul 09 '18 at 15:11
3

Although this is promising, it does not work for Unicode texts. – NoOne Jul 09 '18 at 15:52
It seems that it works fine when `OpCode.Name == "Tj"` (which, I guess, is related to ASCII) and return gibberish when `OpCode.Name == "TJ"` (which, I guess, is Unicode). – NoOne Jul 09 '18 at 16:08
TJ allows glyph-spacing, and to that end, will have integer values between its strings. Neither Tj nor TJ are related to ASCII: both use binary indexes which cannot be depended on to be ASCII. – R.J. Dunnill Jun 03 '19 at 03:54
This works OOTB copy/paste for me in Jan 2021 against PDFs we create from our AS400 / i5. Many Thanks – bkwdesign Jan 29 '21 at 18:04

score 22 · Answer 2 · edited Jun 01 '22 at 22:00

I have implemented it somehow similar to how David did it. Here is my code:

...
{
    // ....
    var page = document.Pages[1];
    CObject content = ContentReader.ReadContent(page);
    var extractedText = ExtractText(content);
    // ...
}

private IEnumerable<string> ExtractText(CObject cObject)
{
    var textList = new List<string>();
    if (cObject is COperator)
    {
        var cOperator = cObject as COperator;
        if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
            cOperator.OpCode.Name == OpCodeName.TJ.ToString())
        {
            foreach (var cOperand in cOperator.Operands)
            {
                textList.AddRange(ExtractText(cOperand));
            }
        }
    }
    else if (cObject is CSequence)
    {
        var cSequence = cObject as CSequence;
        foreach (var element in cSequence)
        {
            textList.AddRange(ExtractText(element));
        }
    }
    else if (cObject is CString)
    {
        var cString = cObject as CString;
        textList.Add(cString.Value);
    }
    return textList;
}

You shouldn't have stripped down the StringBuilder, PDFs are quite big and that solution will cause a huge unnecessary memory consumption. — Ivan Ičin, Aug 20 '16 at 14:37

score 14 · Answer 3 · answered Aug 01 '13 at 08:36

14

PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.

I've uploaded a simple implementation to github.

answered Aug 01 '13 at 08:36

David Schmitt

58,259
26
121
165

7

On many PDFs CString.Value returns just some junk (e.g. create a PDF using OpenOffice.org and try to import it using this method). – Ivan Ičin Aug 20 '16 at 14:52
2

No, PdfSharp does not provide all the tools for text extraction. Functionality has yet to be added for ToUnicode CMaps, which are necessary to extract the text of Unicode PDFs. – R.J. Dunnill Jun 03 '19 at 03:59
1

Because that's the choice I made. – David Schmitt Dec 02 '19 at 11:01
it doesn't seem to be perfect, one word could be split in few lines, eg: Pre dic t ion i s ve – hazjack Sep 28 '20 at 08:24
@hazjack Yeah, you'll need a strong AI then to salvage the text from your PDF. – David Schmitt Sep 29 '20 at 09:07

score -1 · Answer 4 · answered Mar 31 '23 at 03:41

-1

Using this method I actually recently figured out how to do it for what you guys are calling unicode. But it's not exactly unicode, its PdfEncoding. Embedded Fonts causes the pdf to make differences tables called CMaps that you have to store and swap out the pdfEncoding unicode values, until you find one in the cmap table and put it there instead. I turned symbols into readable text and it took 3 weeks of learning about pdf file structures. You'll also need sharpZipLib to inflate the cmap tables as they are compressed.

answered Mar 31 '23 at 03:41

Kaelidian

1

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Viktor Liehr Apr 06 '23 at 13:46

C# Extract text from PDF using PdfSharp

4 Answers4

Linked