3

I've written a script in vba using IE to parse some links from a webpage. The thing is the links are within an iframe. I've twitched my code in such a way so that the script will first find a link within that iframe and navigate to that new page and parse the required content from there. If i do this way then I can get all the links.

Webpage URL: weblink

Successful approach (working one):

Sub Get_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim elem As Object, post As Object

    With IE
        .Visible = True
        .navigate "put here the above link"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set elem = .document.getElementById("compInfo")   #it is within iframe
        .navigate elem.src
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set HTML = .document
    End With

    For Each post In HTML.getElementsByClassName("news")
        With post.getElementsByTagName("a")
         If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
        End With
    Next post
    IE.Quit
End Sub 

I've seen few sites where no such links exist within iframe so, I will have no option to use any link to track down the content.

If you take a look at the below approach by tracking the link then you can notice that I've parsed the content from a webpage which are within Iframe. There is no such link within Iframe to navigate to a new webpage to locate the content. So, I used contentWindow.document instead and found it working flawlessly.

Link to the working code of parsing Iframe content from another site: contentWindow approach

However, my question is: why should i navigate to a new webpage to collect the links as I can see the content in the landing page? I tried using contentWindow.document but it is giving me access denied error. How can I make my below code work using contentWindow.document like I did above?

I tried like this but it throws access denied error:

Sub Get_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim frm As Object, post As Object

    With IE
        .Visible = True
        .Navigate "put here the above link"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set HTML = .document
    End With

    ''the code breaks when it hits the following line "access denied error"

    Set frm = HTML.getElementById("compInfo").contentWindow.document

    For Each post In frm.getElementsByClassName("news")
        With post.getElementsByTagName("a")
         If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
        End With
    Next post
    IE.Quit
End Sub

I've attached an image to let you know which links (they are marked with pencil) I'm after.

These are the elements within which one such link (i would like to grab) is found:

<div class="news">
    <span class="news-date_time"><img src="images/arrow.png" alt="">19 Jan 2018 00:01</span>
    <a style="color:#5b5b5b;" href="/HomeFinancial.aspx?&amp;cocode=INE117A01022&amp;Cname=ABB-India-Ltd&amp;srno=17019039003&amp;opt=9">ABB India Limited - Press Release</a>
 </div>

Image of the links of that page I would like to grab:

enter image description here

From the very first day while creating this thread I strictly requested not to use this url http://hindubusiness.cmlinks.com/Companydetails.aspx?cocode=INE117A01022 to locate the data. I requested any solution from this main_page_link without touching the link within iframe. However, everyone is trying to provide solutions that I've already shown in my post. What did I put a bounty for then?

SIM
  • 21,997
  • 5
  • 37
  • 109
  • 3
    Take a look at [Same-origin policy](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy). – omegastripes Feb 10 '18 at 19:25
  • Why not just make XHR to URL `http://hindubusiness.cmlinks.com/Companydetails.aspx?cocode=INE117A01022` and extract that links from response? – omegastripes Feb 10 '18 at 19:43
  • I started this post after I've done what you have suggested which I've already described in my post. I expect any solution without navigating to a new webpage using the link available in the iframe. As the desired links are available in the first page, I suppose there should be any way to grab those without navigating to a new page. – SIM Feb 10 '18 at 20:34
  • I do not suggest you to navigate, I suggest to make XHR instead of using IE. – omegastripes Feb 10 '18 at 20:44
  • I got you wrong @omegastripes. If I make XHR with the link you have provided above which in reality is within the Iframe, I can get those desired links. The point is why should I go for XHR as you know, If i navigate to the new link I can also get the results. Both of the cases I need to reuse that link. My question is not about the link but about the manipulation of the content within Iframe to reach the results. – SIM Feb 11 '18 at 11:39
  • Topto, have you followed by the link in the first comment? That is the answer why you can see the links but can't access them programmatically. – omegastripes Feb 11 '18 at 11:54
  • Please see the edit. – SIM Feb 20 '18 at 10:14
  • Then the only way IMO is to follow the idea described in the second part of my answer, using `IE.ExecWB`. – omegastripes Feb 20 '18 at 10:15
  • @novice-coder, see the approach I have posted. It is complex in nature but it is I believe the only option I know could work, if you don't want to directly browser to the IFRAME URL – Tarun Lalwani Feb 20 '18 at 14:13
  • @omegastripes Is there something in the link that tells you it is same origin policy applied or is it something implied by the access denied message? – QHarr Jul 15 '18 at 15:03
  • Is it because of the src property of the iframe element being different? And that content from that link is perhaps dynamically loaded? – QHarr Jul 15 '18 at 15:16
  • 1
    @QHarr There are examples by the link I pointed above, showing if two pages have the same origin. Generally the ` – omegastripes Jul 15 '18 at 21:27

3 Answers3

2

Something like this should work. They key is to realize the iFrame is technically another Document. Reviewing the iFrame on the page you listed, you can easily use a web request to get at the data you need. As already mentioned, the reason you get an error is due to the Same-Origin policy. You could write something to get the src of the iFrame then do the web request as I've shown below, or, use IE to scrape the page, get the src, then load that page which looks like what you have done.

I would recommend using a web request approach, Internet Explorer can get annoying, fast.

Code

Public Sub SOExample()
    Dim html     As Object 'To store the HTML content
    Dim Elements As Object 'To store the anchor collection
    Dim Element  As Object 'To iterate the anchor collection
    Set html = CreateObject("htmlFile")

    With CreateObject("MSXML2.XMLHTTP")
        'Navigate to the source of the iFrame, it's another page
        'View the source for the iframe. Alternatively -
        'you could navigate to this page and use IE to scrape it
        .Open "GET", "https://stocks.thehindubusinessline.com/Companydetails.aspx?&cocode=INE117A01022"
        .send ""

        'See if the request was ok, exit it there was an error
        If Not .Status = 200 Then Exit Sub

        'Assign the page's HTML to an HTML object
        html.body.InnerHTML = .responseText
        Set Elements = html.body.document.getElementByID("hmstockchart_CompanyNews1_updateGLVV")
        Set Elements = Elements.getElementsByTagName("a")

        For Each Element In Elements
            'Print out the data to the Immediate window
            Debug.Print Element.InnerText
        Next

    End With
End Sub

Results


ABB India Limited - AGM/Book Closure
Board of ABB India recommends final dividend
ABB India to convene AGM
ABB India to pay dividend
ABB India Limited - Outcome of Board Meeting
More ?
Ryan Wildry
  • 5,612
  • 1
  • 15
  • 35
2

You can see the links within <iframe> in browser but can't access them programmatically due to Same-origin policy.

There is the example showing how to retrieve the links using XHR and RegEx:

Option Explicit

Sub Test()

    Dim sContent As String
    Dim sUrl As String
    Dim aLinks() As String
    Dim i As Long

    ' Retrieve initial webpage HTML content via XHR
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/", False
        .Send
        sContent = .ResponseText
    End With
    'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
    ' Extract target iframe URL via RegEx
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        ' Process all a within div.news
        .Pattern = "<iframe[\s\S]*?src=""([^""]*?Companydetails[^""]*)""[^>]*>"
        sUrl = .Execute(sContent).Item(i).SubMatches(0)
    End With
    ' Retrieve iframe HTML content via XHR
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", sUrl, False
        .Send
        sContent = .ResponseText
    End With
    'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
    ' Parse links via XHR
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        ' Process all anchors within div.news
        .Pattern = "<div class=""news"">[\s\S]*?href=""([^""]*)"
        With .Execute(sContent)
            ReDim aLinks(0 To .Count - 1)
            For i = 0 To .Count - 1
                aLinks(i) = .Item(i).SubMatches(0)
            Next
        End With
    End With
    Debug.Print Join(aLinks, vbCrLf)

End Sub

Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx.

The output for me as follows:

/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17047038016&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17046039003&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17045039006&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043039002&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043010019&opt=9

I also tried to copy the content of the <iframe> from IE to clipboard (for further pasting to the worksheet) using commands:

IE.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DODEFAULT
IE.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT

But actually that commands select and copy the main document, excluding the frame, unless I click on the frame manually. So that might be applied if click on the frame could be reproduced from VBA (frame node methods like .focus and .click didn't help).

omegastripes
  • 12,351
  • 4
  • 45
  • 96
1

The simple of solution like everyone suggested is to directly go the link. This would take the IFRAME out of picture and it would be easier for you loop through links. But in case you still don't like the approach then you need to get a bit deeper into the hole.

Below is a function from a library I wrote long back in VB.NET

https://github.com/tarunlalwani/ScreenCaptureAPI/blob/2646c627b4bb70e36fe2c6603acde4cee3354b39/Source%20Code/ScreenCaptureAPI/ScreenCaptureAPI/ScreenCapture.vb#L803

Private Function _EnumIEFramesDocument(ByVal wb As HTMLDocumentClass) As Collection
    Dim pContainer As olelib.IOleContainer = Nothing
    Dim pEnumerator As olelib.IEnumUnknown = Nothing
    Dim pUnk As olelib.IUnknown = Nothing
    Dim pBrowser As SHDocVW.IWebBrowser2 = Nothing
    Dim pFramesDoc As Collection = New Collection

    _EnumIEFramesDocument = Nothing

    pContainer = wb

    Dim i As Integer = 0

    ' Get an enumerator for the frames
    If pContainer.EnumObjects(olelib.OLECONTF.OLECONTF_EMBEDDINGS, pEnumerator) = 0 Then

        pContainer = Nothing

        ' Enumerate and refresh all the frames
        Do While pEnumerator.Next(1, pUnk) = 0

            On Error Resume Next

            ' Clear errors
            Err.Clear()

            ' Get the IWebBrowser2 interface
            pBrowser = pUnk

            If Err.Number = 0 Then
                pFramesDoc.Add(pBrowser.Document)
                i = i + 1
            End If

        Loop

        pEnumerator = Nothing

    End If

    _EnumIEFramesDocument = pFramesDoc
End Function

So basically this is a VB.NET version of below C++ version

Accessing body (at least some data) in a iframe with IE plugin Browser Helper Object (BHO)

Now you just need to port it to VBA. The only problem you may have is finding the olelib rerefernce. Rest most of it is VBA compatible

So once you get the array of object, you will find the one which belongs to your frame and then you can just that one

frames = _EnumIEFramesDocument(IE)
frames.Item(1).document.getElementsByTagName("A").length
Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265