0

I have written a VBA macro to count the (approximate) number of images returned for a Google search of a specific term. By approximate I mean that the program should count the number of images returned, scroll down to load some more (where applicable) up to a max of 400 images counted. Here's the (simplified) code:

Sub GoogleCount ()

'''
'[Code to construct the URL ('fullUrl')]
'''
    Set objIE = New InternetExplorer
    objIE.navigate fullUrl
    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
    Set currPage = objIE.document
    'Count images returned
    newNum = currPage.getElementById("rg_s").getElementsByTagName("IMG").Length
    'Scroll down until count = 400 (max) or no change in value
    Do While newNum >= 100 And newNum < 400 And newNum <> oldNum
        oldNum = newNum
        currPage.parentWindow.scrollBy 0, 100000
        Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
        newNum = currPage.getElementById("rg_s").getElementsByTagName("IMG").Length
    Loop

'''
'[Code to paste the value of newNum into my workbook, and do some other progress reporting]
'''
End Sub

I'm unhappy about scrolling, it feels very 'manual', especially when you are scrolling by a fixed value (any point making it dynamic? i.e. finding the end of the page and scrolling to there).

But the main problem is that it doesn't work: when I execute the code, it counts the first 100 (or fewer) images fine. But when it's supposed to scroll and count some more, I get the value of 100 returned. Slowly stepping through the code with F8 I get the proper numbers (max 400), which leads me to conclude that the code is running through too quickly (I may be wrong).

To slow the code down I tried adding the objIE.readyState check loop, but because I'm only scrolling I don't think it counts as the page 're-loading' so the loop is ineffective in waiting for the new images to load.

I've thought about adding in a time delay instead. I am already employing

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

elsewhere in the worksheet - so, I could add as small as a millisecond-order delay.

But I really want to avoid using that, as this code runs for c. 50 different searches and takes long enough to execute already, adding in fixed delays that are long enough to accommodate slow connection speeds would not be ideal. Also, internet speeds vary so much that a fixed delay is very unreliable - I could carry out some kind of connection test to get a better ball-park figure, but the best option is obviously only to wait as long as you have to.

Or better still find a different way of counting the images, preferably one which doesn't involve re-loading the page 4 times! Any ideas?

NB. If you want to debug yourself, a good image search to set fullUrl to might be https://www.google.com/search?q=stack overflow|exchange&tbm=isch&source=lnt&tbs=isz:ex,iszw:312,iszh:390 as it returns >100 images but fewer than 400 so you can test all aspects of the code

Greedo
  • 4,967
  • 2
  • 30
  • 78
  • I recommend you directly scroll to the bottom of the page and returns the last image's index instead of using a loop – Seb Mar 03 '17 at 14:05

2 Answers2

2

Through further research I've come up with this approach:

Dim myDiv As HTMLDivElement: Set myDiv = currPage.getElementById("fbar")
Dim elemRect As IHTMLRect: Set elemRect = myDiv.getBoundingClientRect
Do Until elemRect.bottom > 0
    currPage.parentWindow.scrollBy 0, 10000
    Set elemRect = myDiv.getBoundingClientRect
Loop
myDiv.ScrollIntoView

Where currPage is the HTML webpage (Dim currPage As HTMLDocument) and myDiv is a particular element. The type is not important, but it should be noted that myDiv is always located at the bottom of the document and is only loaded once everything else has been. So for Google images that's the help bar, which you only get to after scrolling through all the image results.

How it works

The code works as follows: myDiv.getBoundingClientRect is a way of checking whether an element is visible in the browser - that's why we need to look at an element at the bottom of the page, as if we scroll until that becomes visible, then everything else must have loaded too.

That's of course where the Do Until...Loop comes from; we loop until the elemRect.bottom value is not zero (as when the element is not in view, it's zero, once it's in view it becomes a non-zero number). More info on that see here

Finally, use a myDiv.ScrollIntoView to get the browser right to the bottom; this is necessary because the BoundingClientRect is visible slightly before the element is on screen, so we need to scroll the last bit in order to load the final images.

Why not just use ScrollIntoView form the start? It doesn't work, since the element hasn't loaded yet.

Community
  • 1
  • 1
Greedo
  • 4,967
  • 2
  • 30
  • 78
  • 1
    NB. Advice to anyone doing VBA for internet related stuff, rather than just searching [tag:VBA] tags, search [tag:JavaScript] as well. Although the language is set out slightly differently, it's possible to transfer most of it into VBA just by looking closely. That's how I did this - no VBA answers out there, but plenty for Javascript (which I **don't** speak!!) but which I could easily apply to this problem – Greedo Mar 29 '17 at 13:45
0

Just do this instead, I am sure you can find a nicer way to do it (if you think it's worth the time) but this should be fine :

newNum = -1
Set objIE = New InternetExplorer
objIE.navigate fullUrl
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Set currPage = objIE.document
Do Until oldNum = newNum
    oldNum = newNum
    newNum = currPage.getElementById("rg_s").getElementsByClassName("rg_di rg_bx rg_el ivg-i").Length        
    Application.Wait Now + TimeSerial(0, 0, 2)
    currPage.parentWindow.scrollBy 0, 100000        
    Application.Wait Now + TimeSerial(0, 0, 2)
    If newNum > 400 Then newNum = 400
Loop

Then you just have to adapt the delay in TimeSerial depending on how fast your computer loads ( in here I set in to 2 seconds)

Seb
  • 508
  • 8
  • 25
  • Perhaps that's an idea, although I don't think the looping is making much difference to calculation time, it's only counting a number. However I think you've missed the main point, I need to load the page when I scroll. Perhaps I should've explained better, but when I send the 'scroll' command, Google scrolls only as far as it has loaded images, then stops (so I need to send the command again). That gives 2 options. I wait for a coded time until the next set of images (100 more) has _fully_ loaded. That way I just scroll 4 times to ensure all 400 images are loaded. Or I scroll until ...happens – Greedo Mar 06 '17 at 11:15
  • With that second option if there's a way to keep scrolling until a) 400 images have been loaded or b) No more images are being loaded, ie. I have reached the end of the results - that would probably be what I'm after. I suppose for a) I just send scroll over and over until count = 400, _not ideal_ as I would have to count over and over too, and that would slow stuff down. But b) is harder, if I could detect when the next images are done loading then I can see if everything is done and if trying to scroll has no effect, meaning I've reached the end of the page. I'm just not sure how to do that. – Greedo Mar 06 '17 at 11:20
  • Ask for more explanation of that, I can't edit to clarify my wording. But in general, your line `currPage.parentWindow.scrollBy 0, 10000000` __won't__ scroll to the last image on the page (the 400th, or less for pages with fewer images), it will only scroll to the 100th then stop. That's why I iterate, wait until the images are loaded _then_ scroll some more to load the next bunch – Greedo Mar 06 '17 at 11:25
  • I understood what yo uare trying to do. Did you try to implement my suggestion ? It does scroll down the whole page and will retrieve all the pictures (at least I tried with your link and it works) – Seb Mar 06 '17 at 17:13
  • I have tried it, and found a couple of problems: As I've mentioned, Google loads images (for me at least!) in batches of 100. Scrolling to the end of the first set of 100 images triggers the next set to load. Indeed, for the link I provided, your code `currPage.parentWindow.scrollBy 0, big number` does trigger all 170 or so images to load, as the code scrolls to the end of the 1st 100 and triggers the loading of the next 100. Well since 170 lies within 200, your code _triggers all images to load_, good so far. – Greedo Mar 08 '17 at 14:11
  • There are 2 main problems though. 1st, although the code _triggers_ the next 100 (or 70 in this case) images to load, it doesn't _wait_ for them to load. That's my main issue, how to make Excel wait. Consequently, hitting __F5__ on your code returns a value for me of 100. If I debug manually with F8, I get the correct value for the count, as you scroll to the end of the 1st 100, load the next set and count it all slowly enough to let the second set load. 2nd problem, is that this _doesn't_ scroll the whole way (for me at least). It scrolls 100 then doesn't scroll again when the next set loads – Greedo Mar 08 '17 at 14:21
  • So for links of more than 200 images, like `https://www.google.com/search?q=stack overflow|exchange&tbm=isch&source=lnt&tbs=isz:ex,iszw:1366,iszh:768`, even if I debug through with F8, I get a count of 200 rather than 400 – Greedo Mar 08 '17 at 14:23
  • Ok I see well then see my edit. Even though I still use this "scroll" method which you preferably wanted to avoid as you said in your post, this works now. – Seb Mar 13 '17 at 16:00