0

I am scraping data from thousands of PDF files. Currently I open the PDF file using the Adobe control, then manually do a CTRL+A, CTRL+C to copy the text into the clipboard. Then I click on another button that processes the text and extracts the desired fields in the PDF file.

It would be really nice if I could skip the manual CTRL+A, CTRL+C. I could then automate the process more.

Tips?

Mwiza
  • 7,780
  • 3
  • 46
  • 42
user3573562
  • 191
  • 11
  • 2
    Have you thought about using a PDF API that directly reads the PDF file and extracts the text, without needing Adobe Reader? Your question looks like a duplicate of http://stackoverflow.com/questions/2116440/extracting-text-from-pdfs-in-c-sharp - that is for C#, but any library for C# will also work for VB.NET because they both work in the .NET framework. Benefit of a PDF library: you cut out the overhead of starting and closing Adobe Reader every time. – Amedee Van Gasse Mar 05 '16 at 22:09

1 Answers1

1

Amadee - Thanks for the nudge to try iTextSharp again. I had been getting errors and was really frustrated, but now it works perfectly.

For anyone else trying to do the same, here is my test project code:

    Option Explicit On
    Option Strict On

    Imports System.IO 'Working With Files
    Imports System.Text 'Working With Text
    Imports System.Collections.Generic 'For the StringBuilder

    'iTextSharp Libraries
    Imports iTextSharp.text 'Core PDF Text Functionalities
    Imports iTextSharp.text.pdf 'PDF Content
    Imports iTextSharp.text.pdf.parser 'Content Parser


    Public Class Form1

        Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load

            Dim strFileName As String
            Dim strText As String
            Dim intPageCount As Integer
            Dim intI As Integer

            Dim strOut As StringBuilder = New StringBuilder()

            strFileName = "E:\2020-Skysight-14288.pdf"
            Label_Filename.Text = strFileName

            Dim Reader As New PdfReader(strFileName) 'Read Our File

            intPageCount = Reader.NumberOfPages

            Label_PageCount.Text = intPageCount.ToString & "Pages"

            For intI = 1 To intPageCount
                strText = PdfTextExtractor.GetTextFromPage(Reader, intI)
                strOut.Append(strText)
            Next

            RichTextBox1.AppendText(strOut.ToString)

            strText = strOut.ToString

        End Sub

    End Class
user3573562
  • 191
  • 11