0

I'm working in an azure function to extract the text of a pdf file. I want to convert a stream (received by an azure blob storage containing a pdf) into a pdf so I can use the code of this question here.

public static class PdfSharpExtensions
{
    public static IEnumerable<string> ExtractText(this PdfPage page)
    {       
        var content = ContentReader.ReadContent(page);      
        var text = content.ExtractText();
        return text;
    }   

    public static IEnumerable<string> ExtractText(this CObject cObject)
    {   
        if (cObject is COperator)
        {
            var cOperator = cObject as COperator;
            if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                cOperator.OpCode.Name == OpCodeName.TJ.ToString())
            {
                foreach (var cOperand in cOperator.Operands)
                    foreach (var txt in ExtractText(cOperand))
                        yield return txt;   
            }
        }
        else if (cObject is CSequence)
        {
            var cSequence = cObject as CSequence;
            foreach (var element in cSequence)
                foreach (var txt in ExtractText(element))
                    yield return txt;
        }
        else if (cObject is CString)
        {
            var cString = cObject as CString;
            yield return cString.Value;
        }
    }
}

Is there a way to do it?

Xavi Andreu
  • 101
  • 2
  • 12

1 Answers1

2

So as I understand it you need to create a PDF from the stream and then use the PDF to read the content.

So firstly we need to create a PDF from a MemoryStream, but wait we only have a Stream so we need to convert it to a MemoryStream like so:

public static void CopyStream(Stream input, Stream output)
{    
    byte[] buffer = new byte[16*1024];
    int read;
    while((read = input.Read (buffer, 0, buffer.Length)) > 0)
    {
        output.Write (buffer, 0, read);
    }
}

// Create MemoryStream
var ms = new MemoryStream();
CopyStream(streamFromDatabase, ms);

// Create PDF from MemoryStream
var pdf = PdfReader.Open(ms);

And now we can read the text from it like so:

var sb = new StringBuilder();

foreach (var page in pdf.Pages)
{
     sb.Append(string.Join("", page.ExtractText().ToArray()));
}
MindSwipe
  • 7,193
  • 24
  • 47
  • I used what you said and the sb returns this `ExtractTextFromPDF.PdfSharpExtensions+d__1` (**ExtractTextFromPDF** is the name of the project, **PdfSharpExtensions** is the name of the class where I have the extension method to extract the text) – Xavi Andreu Oct 09 '19 at 15:05
  • Put a breakpoint on the line `var pdf = PdfReader.Open(ms); ` and inspect `pdf` and `ms`. What are they? – MindSwipe Oct 10 '19 at 05:34
  • `ms` appears to be null and have a read and write timeout. The pdf has the author, creation date and other data right, but I can't find the content. `AcroForm` is null, I don't know if that helps. – Xavi Andreu Oct 10 '19 at 07:38
  • What is `AcroForm`? Also `ms` can't be null, it's probably just empty after converting to a pdf. Is the `pdf` variable correct? Does it have pages and such? – MindSwipe Oct 10 '19 at 07:42
  • `AcroForm` is a pdf property. It does have the page count correctly. I don't know what properties should I look for. – Xavi Andreu Oct 10 '19 at 07:50
  • Don't use the AcroFrom Property, just directly iterate over the `Pages` property and pass each value – MindSwipe Oct 10 '19 at 08:21
  • Sorry, I didn't write it correctly so it led into a confusion. ´Acroform´ is property and I just mentioned it is null, it has nothing else inside it and no correlation with pages. I debugged the pdf variable and then the individual page but I can't fins any text property or something similar. – Xavi Andreu Oct 10 '19 at 08:31
  • Should the `Stream` property of pdf.Pages have the text? Currently it is null – Xavi Andreu Oct 10 '19 at 09:58
  • No. I don't where the text is, I simply know that `Pages` is a Enumerable of PdfPage. So we can iterate (foreach) over the Pages and pass each individual page to the ExtractText Method (page.ExtractText). Wha I just saw now is that ExtractText method returns an IEnumerable of Strings so we need to join them togheter. I updated my answer – MindSwipe Oct 10 '19 at 10:26
  • That was the problem. It worked lights out. Thank you! – Xavi Andreu Oct 10 '19 at 10:36