2

I'm using PDF Extractor (from here) to get the text from PDF attachments in emails.

It seems to me that the only way I can extract the text is to save the PDF to a file, and then using the code.

Private Function ReadPdfToStringList(tempfilename As String) As List(Of String)
    Dim extractedText As String
    Using pdfFile As FileStream = File.OpenRead(tempfilename)
        Using extractor As Extractor = New Extractor()
            extractedText = extractor.ExtractToString(pdfFile)
        End Using
    End Using
    DeleteTempFile()
    Return New List(Of String)(extractedText.Split(Chr(13)))
End Function

to extract a list of Strings from the PDF file.

However, I cant seem to extract text from the attachment directly. The 'extractor' doesnt seem to be able to handle any source other than a file on disk.

Is there any possible way of either tricking the 'extractor' into opening a file from memory maybe by creating an in memory file stream?

I've tried using a MemoryStream like this:

Private Function ReadPdfMemStrmToStringList(memstream As MemoryStream) As List(Of String)
    Dim extractedText As String
    Using extractor As Extractor = New Extractor()
        extractedText = extractor.ExtractToString(memstream)
    End Using
    Return New List(Of String)(extractedText.Split(Chr(13)))
End Function

but because the extractor is assuming the source is a disk file, it returns an error saying that it cant find a temporary file.

To be honest I've spent quite a bit of time trying to understand memory streams and they don't seem to fit the bill.

UPDATE

Here also is the code that I'm using to save the attachment to the MemoryStream.

Private Sub SaveAttachmentToMemStrm(msg As MimeMessage)
    Dim memstrm As New MemoryStream
    For Each attachment As MimePart In msg.Attachments
        If attachment.FileName.Contains("booking") Then
            attachment.WriteTo(memstrm)
        End If
    Next
    'this line only adds the memory stream to a List (of MemoryStream)
    attachments.Add(memstrm)
End Sub

Many apologies if I've missed something obvious.

David Wilson
  • 4,369
  • 3
  • 18
  • 31
  • If you look at the [source code](https://github.com/poulfoged/pdf-extract/blob/master/source/PdfExtract/TemporaryFile.cs) for the `TemporaryFile` class, it uses the [Path.GetTempPath Method](https://msdn.microsoft.com/en-us/library/system.io.path.gettemppath(v=vs.110).aspx). Is there any reason that `Path.GetTempPath` might fail in the environment you're running it in, e.g. as a Windows service? – Andrew Morton Aug 21 '16 at 15:36
  • @AndrewMorton Sorry, I don't read c# very well. Does it look like the extractor code is creating a temporary file every time I execute it then? It works just fine when I use an on disk PDF - I get all the text, but I'm trying to find a way to use an in-memory source. I'm running the code as a regular Winforms program. Sorry if I'm not being very clear. – David Wilson Aug 21 '16 at 16:00
  • It looks like the extractor creates two temporary files for the Xpdf program to use (input and ouput). I can't see why it won't work with a MemoryStream as the input to the PDF-extract wrapper. What is the *actual* error message? – Andrew Morton Aug 21 '16 at 16:09
  • The error is .. An unhandled exception of type system.IO.FileNotFoundException' occurred in mscorlib.dll Additional information: Could not find file C:\Users\Dad\AppData\Local\Temp\d10adc4175f54de2a7cf04bc712e0df6.tmp'. – David Wilson Aug 21 '16 at 16:14
  • Just a thought .. am I implementing the memory stream correctly? I'll add the code that saves the attachment to a stream in a moment. – David Wilson Aug 21 '16 at 16:15
  • @AndrewMorton OK I've added the code that I'm using to add the attachment to a MemoryStream. I've also tried it without the If statement to try adding all the attachments to the MemoryStream – David Wilson Aug 21 '16 at 16:25
  • At this point, I'd look for an alternative PDF parser. Maybe [PdfReader from MemoryStream()](http://stackoverflow.com/q/14939102/1115360) and [Reading PDF content with itextsharp dll in VB.NET or C#](http://stackoverflow.com/a/5003230/1115360) are of use to you. – Andrew Morton Aug 21 '16 at 16:50
  • Yep I think I agree with you there. Thanks for your time though. – David Wilson Aug 21 '16 at 16:51

0 Answers0