1

My needs are quite simple, I need a tool or library (library would be perfect), to convert a PDF file to an HTML file keeping as many of the information as possible, except any images or styles, just semantic information.

I've checked out iTextPdf, but I haven't found anything like it. Any help would be nice.

Thanks in advance

David Conde
  • 4,631
  • 2
  • 35
  • 48

1 Answers1

4

Use iTextSharp. It's free and you only need the "itextsharp.dll".

http://sourceforge.net/projects/itextsharp/

Here is a simple function for reading the text out of a PDF.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function
Carter Medlin
  • 11,857
  • 5
  • 62
  • 68