0

Part 3 of a previous post.

The task: I am attempting to iterate over a series of URLs presented in excel and generate complete text files for each.

So far: The VBA solution in the previous post uses Word to open the PDF Url, and from this generate a text file. This is a much better improvement than using power query. However, I have come across one instance where it fails as it recognises the text on the first page as an image; therefore, this text is omitted in the generated text file.

As an example of a truncated text file generated: https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b is an example where the generated PDF text file has text missing.

As advised, I have downloaded a version of Adobe Acrobat (Adobe Acrobat XI Standard), which enables Adobe Acrobat 10.0 Type Library Reference (and a bunch of others). I hope this will enable more accurate text conversion but will need help to develop this script.

So, for this problem, I wish to modify the following code to swap Word for Abode to perform the conversion. What I am unsure of, however, is whether Adobe can open a PDF from a URL in the same way word can. I think I may need to introduce some steps to download the PDF first in the folder and then for each convert those. Ideally, It would work as the above code but if this is necessary, then so be it.

M Code from the previous answer:

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(url)
            On Error GoTo 0      'stop ignoring errors
            If Not oDoc Is Nothing Then
                filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
                Debug.Print filePath
                'open text stream as unicode
                Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                fileStream.Write oDoc.Range.Text
                fileStream.Close
                oDoc.Close
                c.Interior.Color = vbGreen 'flag OK
            Else
                c.Interior.Color = vbRed   'flag problem
            End If
        End If 'have url
    Next c
    
    oWd.Quit
End Sub

As for developing this script, I have the following from what I have found online, which now successfully performs the conversion using Acrobat for the previously difficult PDF so Good start.

Sub convertpdf2()

    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim Filename As String
    Dim jsObj As Object
    Dim NewFileName As String

    Filename = "C:\temp\name1.pdf"
    NewFileName = "C:\temp\name1.txt"

    Set AcroXApp = CreateObject("AcroExch.App")
    'AcroXApp.Show

    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    AcroXAVDoc.Open Filename, "Acrobat"

    Set AcroXPDDoc = AcroXAVDoc.GetPDDoc


    Set jsObj = AcroXPDDoc.GetJSObject


    jsObj.SaveAs NewFileName, "com.adobe.acrobat.plain-text"


    AcroXAVDoc.Close False
    AcroXApp.Hide
    AcroXApp.Exit

    End Sub

Where name1 is a downloaded PDF.

So, in essence I am trying to:

  • Download the PDF from a List of URLs/Open the PDF to a specified folder (if necessary)
  • use VBA to convert this to text

I think if I can achieve this, then I will be able to work out how to merge this with the previous post so that it can loop through a given number of URLs to generate the text files as well as handle the non-unicofe characters.

Will update if I make any progress

Setup in Excel: enter image description here

URLs:

https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6

Update: Attempting to add Error handling Highlighting

M Code:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
    Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Sub Tester1()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String
    
    
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object

    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("C2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            
            Debug.Print pdfPath, c.Value
            
            DownloadFile url, pdfPath
            
            
            Set AcroXApp = CreateObject("AcroExch.App")
            Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
            AcroXAVDoc.Open pdfPath, "Acrobat" & c.Offset(0, -1).Value & ".txt"
            Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        
            If Error <> 0 Then
            
            c.Interior.Color = vbRed   'flag problem
            
            Else
    
            Set jsObj = AcroXPDDoc.GetJSObject
            jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
            AcroXAVDoc.Close False
            AcroXApp.Hide
            AcroXApp.Exit
            
            c.Interior.Color = vbGreen 'flag OK
            
            End If
             
            
        End If 'have url
        
        
    Next c


    kill pdfPath ' toggle on/off to keep PDF


End Sub

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function
Nick
  • 789
  • 5
  • 22

2 Answers2

2

Tested:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function

Sub Tester()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String, success As Boolean
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            DownloadFile url, pdfPath
            
            If IsErrorPage(pdfPath) Then
                c.Interior.Color = vbRed
                c.Offset(0, 1).Value = "No PDF returned" 'flag in col C
            Else
                success = ConvertPdf2(pdfPath, fileRoot & c.Offset(0, -1).Value & ".txt")
                c.Interior.Color = IIf(success, vbGreen, vbRed)
                c.Offset(0, 1).Value = IIf(success, "OK", "PDF not openable")
            End If
            
        End If 'have url
    Next c
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

'is the supposed PDF actually an HTML error page?
Function IsErrorPage(path) As Boolean
    Dim txt
    txt = CreateObject("scripting.filesystemobject"). _
                  OpenTextFile(path, 1).ReadAll()
    IsErrorPage = InStr(1, txt, "<!DOCTYPE", vbTextCompare) > 0
End Function
Tim Williams
  • 154,628
  • 8
  • 97
  • 125
  • For the private declare function, im getting a compile error. `The code in the project must be updated for use on 64-bit systems. Please review and declare statements and then mark them with the PtSafe attribute`. – Nick Aug 23 '22 at 16:19
  • I have come across this before, its straightforward something to do with: https://learn.microsoft.com/en-us/previous-versions/office/troubleshoot/office-developer/compile-error-editing-vba-macro – Nick Aug 23 '22 at 16:21
  • Sorry I meant to add a note on the 64-bit update. I'm still 32-bit. See https://stackoverflow.com/a/35339528/478884 Updated above. – Tim Williams Aug 23 '22 at 16:22
  • Seem to be working, will test out with multiple PDFs. Also do you know why the else part is red? Please see update. Works as it but seems odd. had this before where it works but the code is red.. – Nick Aug 23 '22 at 16:29
  • I can't spot it but there must be a syntax error in the highlighted part. – Tim Williams Aug 23 '22 at 16:43
  • Hey again, do you know how I can add error handling to this? I'd like to do the same as you added before with the green and red highlighting if say you had a URL like https://hpvchemicals.oecd.org/ui/handler.axd?id=BROKENTEST as part of the list. unlike you previous answer where you have a if nothing statement, using acrobat we don't have this. When a broken URL does occur, Acorbat flashes up with a dialogue box which I click okay to resume, I wonder if its possible to script this to resume without the box? I suspect this is more difficult, but the highlighting would be useful – Nick Aug 25 '22 at 16:04
  • I have attempted various methods fo On Error or If Error but cant quite get it to loop as required. resulting in all the cells being highlighted green even if there is an error. Or alternatively everything turns red as it works through because it reads this as the last step. The logic of Error handling in VBA is confusing – Nick Aug 25 '22 at 16:05
  • Looks like when there is no PDF at the link you get an HTML response which ends up getting saved to disk - you can try reading that file as a text file, and check to see if the content contains (eg) ` – Tim Williams Aug 25 '22 at 16:28
  • Sorry if I wasnt clear, The link provided is made up i.e. to test out the conversion by acrobat. When I run this I get `Acrobat could not open PDF.. because it is either not supported, or because the file has been damaged`. This makes sense because as I say, its made up, im trying to crash the script so I can handle errors. At the moment I just say On error resume next in the covertPDF2 subscript which is fine. it just skips over this and generates the remaining text files however, I don't see how to add the highlighting as per your previous answer. – Nick Aug 25 '22 at 16:35
  • like cause the HTML is broken, on clicking okay on the dialogue box causes an error. Im now trying to be implement the colours so that on error it's highlighted red. I think I can live with manually having to click okay if there is an error in acrobat. Sorry if that doesn't make sense. Happy to post a question but I feel a part 4 may be too many aha.. – Nick Aug 25 '22 at 16:37
  • I have provided an update to the code above, if you run it, it will highlight everything in Red if an error occurs with the Brokentest url – Nick Aug 25 '22 at 16:51
  • 2
    I understand the issue well enough: if the `id` passed to `handler.axd` is not valid then it returns an HTML error page in response, and not a PDF file: you can check for that using the function `IsErrorPage` in my updated answer above. If that passes but the file is *still* not openable, then you'll have to click "OK" in Acrobat to continue but the rest of the process should proceed OK. Added logging of results into column C. you're getting your money's worth on this one.. – Tim Williams Aug 25 '22 at 17:20
  • Im sorry, you have been unbelievably helpful, do let me know if you have any pateron or equivalent I can donate to! – Nick Aug 25 '22 at 20:54
  • No problem - that code might well be useful for my own work so a good exercise for me... – Tim Williams Aug 25 '22 at 20:55
1

I'm not in a position to help with automating acrobat, but here is a procedure I use to download PDFs. Supply a URL to a PDF file and a local path to save the file. Only works on the windows version of Excel.

Sub saveFile(url As String, Optional path As String)

    Dim http As Object
    Dim objfso As Object
    Dim objADOStream As Object
    
    If path = "" Then path = ThisWorkbook.path & "\file.pdf"
    
    Set http = CreateObject("MSXML2.serverXMLHTTP")
    http.Open "GET", url, False
    http.Send
 
    If http.Status = 200 Then
      Set objADOStream = CreateObject("ADODB.Stream")
      objADOStream.Open
      objADOStream.Type = 1 'adTypeBinary
 
      objADOStream.Write http.responsebody
      objADOStream.Position = 0    'Set the stream position to the start
 
      Set objfso = CreateObject("Scripting.FileSystemObject")
        If objfso.fileexists(path) Then objfso.DeleteFile path
      Set objfso = Nothing
 
      objADOStream.SaveToFile path
      objADOStream.Close
      Set objADOStream = Nothing
    End If
 
End Sub
Gove
  • 1,745
  • 10
  • 11