I'm using PDF Extractor (from here) to get the text from PDF attachments in emails.
It seems to me that the only way I can extract the text is to save the PDF to a file, and then using the code.
Private Function ReadPdfToStringList(tempfilename As String) As List(Of String)
Dim extractedText As String
Using pdfFile As FileStream = File.OpenRead(tempfilename)
Using extractor As Extractor = New Extractor()
extractedText = extractor.ExtractToString(pdfFile)
End Using
End Using
DeleteTempFile()
Return New List(Of String)(extractedText.Split(Chr(13)))
End Function
to extract a list of Strings from the PDF file.
However, I cant seem to extract text from the attachment directly. The 'extractor' doesnt seem to be able to handle any source other than a file on disk.
Is there any possible way of either tricking the 'extractor' into opening a file from memory maybe by creating an in memory file stream?
I've tried using a MemoryStream
like this:
Private Function ReadPdfMemStrmToStringList(memstream As MemoryStream) As List(Of String)
Dim extractedText As String
Using extractor As Extractor = New Extractor()
extractedText = extractor.ExtractToString(memstream)
End Using
Return New List(Of String)(extractedText.Split(Chr(13)))
End Function
but because the extractor is assuming the source is a disk file, it returns an error saying that it cant find a temporary file.
To be honest I've spent quite a bit of time trying to understand memory streams and they don't seem to fit the bill.
UPDATE
Here also is the code that I'm using to save the attachment to the MemoryStream.
Private Sub SaveAttachmentToMemStrm(msg As MimeMessage)
Dim memstrm As New MemoryStream
For Each attachment As MimePart In msg.Attachments
If attachment.FileName.Contains("booking") Then
attachment.WriteTo(memstrm)
End If
Next
'this line only adds the memory stream to a List (of MemoryStream)
attachments.Add(memstrm)
End Sub
Many apologies if I've missed something obvious.