1

I'm trying to scrape data from "https://beacon.schneidercorp.com/" and need to achieve:

  1. Set "Iowa" on the state combobox and "Adair County, IA" in the County/city/area combobox
  2. Bring the Property Search button
  3. Click the Property Search button and get to the next page

After all this, the browser gets to "https://beacon.schneidercorp.com/Application.aspx?AppID=1034&LayerID=22042&PageTypeID=2&PageID=9328" which is my main goal.

I filled the comboboxes (tagname="option") but the next problems came up:

a. The Property Search I want to click to get to the next page, doesn't pop up until I physically click and select one option on the County/city/area combobox

This is the routine that fills the comboboxes

Sub extraccionCondados2()
   Dim IE As New SHDocVw.InternetExplorer
   Dim htmlDoc As MSHTML.HTMLDocument
   Dim htmlElementos As MSHTML.IHTMLElementCollection
   Dim htmlElemento As MSHTML.IHTMLElement
   
   IE.Visible = True
   IE.navigate "https://beacon.schneidercorp.com/"
    
   Do While IE.readyState <> READYSTATE_COMPLETE
      DoEvents
   Loop
   
   Set htmlDoc = IE.document
   Set htmlElementos = htmlDoc.getElementsByClassName("form-control input-lg")
   htmlElementos(0).Value = "Iowa" 'POPULATES THE STATE COMBOBOX
   htmlElementos(1).Value = "1034" 'POPULATES THE COUNTY/CITY/AREA WITH THE RIGHT VALUE
   htmlElementos(1).Click 'IN THIS CASE THIS LINE DOESN'T DO ANYTHING
   'I'VE TRIED WORKING WITH htmlElementos CHILDREN BUT DIDN'T FIND A WAY TO DO IT
End Sub

b. The href I'm looking for doesn't come up until the Property Search is brought to the view

The id="quickstartList" is empty before the Property Search is shown
enter image description here

The id="quickstartList" got new children after the Property Search is shown and has my target URL
enter image description here

How do I bring the Property Search button, or better, fetch the href on the second image?

Community
  • 1
  • 1
  • If your target page is "https://beacon.schneidercorp.com/Application.aspx?AppID=1034&LayerID=22042&PageTypeID=2&PageID=9328" then why not just navigate directly to it? Replace `IE.Navigate "https://beacon.schneidercorp.com/"` with `IE.Navigate "https://beacon.schneidercorp.com/Application.aspx?AppID=1034&LayerID=22042&PageTypeID=2&PageID=9328"` – jamheadart Jun 08 '20 at 04:05
  • @jamheadart I think *Iowa* is only an example. – Zwenn Jun 08 '20 at 09:20
  • I figure COUNTY/CITY/AREA field is what defines the page direction and he needs to know the code for that anyway (1034) - if he needs to know those codes in the first place then just sub them in the `https` string for every option he requires. – jamheadart Jun 08 '20 at 09:41
  • Using IE is an awful way of automating datascraping anyway, if you have the time I really recommend using background http objects to do what you require https://stackoverflow.com/questions/158633/how-can-i-send-an-http-post-request-to-a-server-from-excel-using-vba – jamheadart Jun 08 '20 at 09:42
  • @jamheadart thanks for your comments... I first set the direct URL, but the website is constantly changing and the last week URL changed, so it made the scraper crash... I know that IE is no the best way but I rather using it while trobleshooting becuase it let me see the performance... I'll give a check to your suggestion – Antonio Graterol Jun 08 '20 at 15:20
  • While doing all this, keep an eye on the Network tab of the developer tool (F12) - you can see the raw requests and responses and it's these you should emulate rather than button clicks etc. I promise you if you invest the time to use background objects instead of IE automation it's so much better! – jamheadart Jun 09 '20 at 02:56
  • @jamheadart I'm trying to have the same results of the zwenn code but it seems to be necessary to use IE object. Is there a way to have such results using XMLHTTP, ServerXMLHTTP or WinHttp? – Antonio Graterol Jun 09 '20 at 21:20
  • @jamheadart I was checking that Network Tab but didn't understanding it, I'm going to research about it... Can you give a deeper hint of what and when I need to be aware of? – Antonio Graterol Jun 09 '20 at 21:21
  • @jamheadart when you say "background objects" what are you meaning? I've google it but nothing didn't hit my eyes – Antonio Graterol Jun 09 '20 at 21:22
  • Yeah by background objects I mean things that can send web requests without needing a browser. An internet browser is mainly for graphical interfacing (and can take much longer to load results). In Excel VBA you can use `ServerXMLHTTP` - I built a whole system at work using POST and GET requests sent via ServerXMLHTTP. You just need the URL for GET requests and POST requests will be sent with a "body". Once you know the structure on your target websites you just replace parts of the URL and Body with variables and can parse out information with the XML/HTML it returns. – jamheadart Jun 10 '20 at 04:56

2 Answers2

3

You must trigger the change event after each selection from a combobox:

Sub extraccionCondados2()
  Dim IE As New SHDocVw.InternetExplorer
  Dim htmlDoc As MSHTML.htmlDocument
  Dim htmlElementos As MSHTML.IHTMLElementCollection
  Dim htmlElemento As MSHTML.IHTMLElement
  Dim urlFromPropertySearchButton As String

  IE.Visible = True
  IE.navigate "https://beacon.schneidercorp.com/"
  Do While IE.readyState <> 4: DoEvents: Loop

  Set htmlDoc = IE.document
  Set htmlElementos = htmlDoc.getElementsByClassName("form-control input-lg")

  'Select state and trigger html change event of the combobox
  htmlElementos(0).Value = "Iowa"
  Call TriggerEvent(htmlDoc, htmlElementos(0), "change")

  'Select country/city/area and trigger html change event of the combobox
  htmlElementos(1).Value = "1034"
  Call TriggerEvent(htmlDoc, htmlElementos(1), "change")

  'Get property search button
  Set htmlElemento = htmlDoc.getElementsByClassName("list-group-item track-mru")(0)

  'If needed as string read url
  urlFromPropertySearchButton = htmlElemento.href
  'You have the url before clicking the button
  MsgBox urlFromPropertySearchButton

  'If you want to open the page for selection
  htmlElemento.Click
End Sub

This procedure to trigger a html event:

Private Sub TriggerEvent(htmlDocument As Object, htmlElementWithEvent As Object, eventType As String)

  Dim theEvent As Object

  htmlElementWithEvent.Focus
  Set theEvent = htmlDocument.createEvent("HTMLEvents")
  theEvent.initEvent eventType, True, False
  htmlElementWithEvent.dispatchEvent theEvent
End Sub
Zwenn
  • 2,147
  • 2
  • 8
  • 14
  • I'm very thankful for your help, your code works like a charm. I also casted my vote to your solution but as I'm a newbie it doesn't count very much... Can you point out where I can find more good information to keep learning web-scraping? – Antonio Graterol Jun 08 '20 at 15:50
  • 1
    @QHarr Didn't know how. I just did it! Thanks – Antonio Graterol Jun 08 '20 at 21:41
  • 1
    @QHarr I had completely misunderstood something yesterday and therefore deleted my comment from yesterday. You're right, `Call` is not necessary. I just like to use it because I find the code more pleasant to read. – Zwenn Jun 09 '20 at 14:32
  • @AntonioGraterol I do not know of a central place where everything is explained about web scraping. But it is a good idea to use the links from QHarr's profile as a starting point for learning. Further questions for own researches will automatically arise. Above all it is important to gain practical experience. To program, to try things out, to study and change the code of others etc. – Zwenn Jun 09 '20 at 14:43
  • Many thanks... I'm researching to solve a new hurdle in this project, if I don't find the solution I'll ask again – Antonio Graterol Jun 09 '20 at 19:45
1

Some advice on using MSXML2.ServerHTTP objects to automate web-scraping using your target website as an example.

Firstly, you can get to the page you wanted in the question like this:

Sub Example1()

Dim con As New MSXML2.ServerXMLHTTP60 ' A web request object - must add project reference to "Microsoft XML, V6.0" in Tools > References

    ' Opens a new GET request (no hidden info) for the url
    con.Open "GET", "https://beacon.schneidercorp.com/Application.aspx?AppID=1034&PageTypeID=2"
    con.setRequestHeader "Content-type", "application/x-www-form-urlencoded" ' set a standard content-type for the request
    con.send searchBody ' Send the request

    MsgBox con.responseText

End Sub

Note in the URL I've only had to include AppID=1034 for Adair county and PageTypeID=2 for property search (I think pagetypeId 1 was map). You can get the full list of AppID from the main page just by looking at the HTML (I guess you've figured out how to do this already). The MsgBox just shows that the con object has returned the response as an html document.

While working on your project and to help debug and look at html, if you want to view any response from a request at leisure, I use the below function to save a string as a text file:

Sub WriteToFile(s As String, n As String)
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim oFile As Object
Set oFile = fso.CreateTextFile(n)
oFile.WriteLine s
oFile.Close
Set fso = Nothing
Set oFile = Nothing
End Sub

So for the above code I'd call that function at the end to save my response as text files which I can view as HTML using notepad++. You can just view the html in the F12 dev tool too without saving it.

I've also included below an HTMLdocument object, which I put the response into.

Sub Example2()

Dim con As New MSXML2.ServerXMLHTTP60 ' A web request object - must add project reference to "Microsoft XML, V6.0" in Tools > References
Dim html As New HTMLDocument ' An html document to hold responses, used to parse info - add reference to "Microsoft HTML Object Library"

    ' Opens a new GET request (no hidden info) for the url
    con.Open "GET", "https://beacon.schneidercorp.com/Application.aspx?AppID=1034&PageTypeID=2"
    con.setRequestHeader "Content-type", "application/x-www-form-urlencoded" ' set a standard content-type for the request
    con.send searchBody ' Send the request

    WriteToFile con.responseText, "C:\Users\JamHeadArt\Documents\responseText.txt"
    html.body.innerhtml = con.responseBody

End Sub

With the html document populated, you can then use things like getElementByID to help parse results etc. It's just another form of XML so you can traverse nodes and find things by child/parent relationships etc.


Using the F12 dev tool

I can figure out this stuff using the F12 developer tool, under network. Before clicking a search button or whatever, just clear the network traffic and then when you click a search, you'll see a bunch of requests. The first one is usually the one you want to check out and basically mimic (the rest of the requests will be javascript firing, css, images, general stuff). Any request has a URL and sometimes a BODY if it's a post request.

Without going in to TOO much detail, you can usually skip a whole bunch of search steps and pages, and get the info you need by knowing the structure and parameters of that final search, making literally one call to the website, with the return info parsed directly into Excel. No browsers used, much much faster.


After selecting Iowa, did you find the html for the drop down list in the html that has all the option values?

<optgroup label="Iowa">
    <option value="1034">Adair County,  IA</option>
    <option value="78">Allamakee County, IA</option>
    <option value="165">Ames, IA</option>
    <option value="96">Audubon County, IA</option>
    <option value="83">Benton County, IA</option>
    <option value="84">Boone County, IA</option>
    <option value="330">Bremer County, IA</option>
    <option value="1015">Buena Vista County,  IA</option>
    <option value="215">Cass County, IA</option>
    <option value="408">Cerro Gordo County, IA</option>
    <option value="501">Cherokee County, IA</option>
    <option value="47">Chickasaw County, IA</option>
    <option value="29">City of Ames, IA - Traffic Accident Database</option>
    <option value="933">City of Cascade, IA</option>
    <option value="516">City of Estherville, IA</option>
    <option value="1061">City of Sigourney, IA</option>
    <option value="1043">Clay County,  IA</option>
    <option value="227">Clayton County, IA</option>
    <option value="375">Clinton County, IA</option>
    <option value="909">Dallas County,  IA</option>
    <option value="49">Davis County, IA</option>
    <option value="72">Delaware County, IA</option>
    <option value="376">Dickinson County, IA</option>
    <option value="93">Dubuque County, IA</option>
    <option value="15">Emmet County, IA</option>
    <option value="79">Fayette County, IA</option>
    <option value="82">Floyd County, IA</option>
    <option value="150">Franklin County, IA</option>
    <option value="825">Fremont County,  IA</option>
    <option value="1064">Greene County,  IA</option>
    <option value="3">Grundy County, IA</option>
    <option value="395">Guthrie County, IA</option>
    <option value="140">Hardin County, IA</option>
    <option value="44">Harrison County, IA</option>
    <option value="60">Henry County, IA</option>
    <option value="617">Humboldt County, IA</option>
    <option value="80">Jackson County, IA</option>
    <option value="325">Jasper County, IA</option>
    <option value="1037">Jefferson County,  IA</option>
    <option value="86">Johnson County, IA</option>
    <option value="164">Jones County, IA</option>
    <option value="81">Keokuk County, IA</option>
    <option value="177">Lee County, IA</option>
    <option value="54">Louisa County, IA</option>
    <option value="594">Lyon County, IA</option>
    <option value="406">Madison County, IA</option>
    <option value="25">Mahaska County, IA</option>
    <option value="70">Marion County, IA</option>
    <option value="1026">Marshall County,  IA</option>
    <option value="410">Mason City, IA</option>
    <option value="153">Mills County, IA</option>
    <option value="929">Mitchell County,  IA</option>
    <option value="21">Montgomery County, IA</option>
    <option value="12">Muscatine Area Geographic Information Consortium (MAGIC)</option>
    <option value="331">O'Brien County, IA</option>
    <option value="611">Osceola County, IA</option>
    <option value="220">Page County, IA</option>
    <option value="218">Palo Alto County, IA</option>
    <option value="1012">Plymouth County,  IA</option>
    <option value="144">Pocahontas County, IA</option>
    <option value="135">Poweshiek County, IA</option>
    <option value="508">Ringgold County, IA</option>
    <option value="75">Sac County, IA</option>
    <option value="1024">Scott County / City of Davenport, Iowa</option>
    <option value="11">Shelby County, IA</option>
    <option value="10">Sioux City, IA</option>
    <option value="984">Sioux County,  IA</option>
    <option value="165">Story County, IA / City of Ames</option>
    <option value="225">Union County, IA</option>
    <option value="595">Wapello County, IA</option>
    <option value="9">Warren County, IA</option>
    <option value="1036">Washington County,  IA</option>
    <option value="723">Webster County, IA</option>
    <option value="73">Winnebago County, IA</option>
    <option value="110">Winneshiek County, IA</option>
    <option value="10">Woodbury County, IA / Sioux City</option>
    <option value="588">Worth County, IA</option>
    <option value="399">Wright County, IA</option>
</optgroup>
jamheadart
  • 5,047
  • 4
  • 32
  • 63
  • That's amazing. I'm just waking up but I'm looking forward to test it... Thanks for spending your time in doing such good explanation. I'll let you know how it goes – Antonio Graterol Jun 10 '20 at 09:26
  • It's just to get started, there will be so many different potential things on each website, that it'd be impossible to write a full guide. That's probably why no comprehensive guide exists, but whenever you get stuck on a particular issue, StackOverflow is the place to go! I learned everything I know through googling problems and asking questions on here. – jamheadart Jun 10 '20 at 09:29
  • You just gave a so simple solution that I really couldn't get it... All the available time that I had these past two days was applied to learn how to make a POST Request sending `value="1034"` or `innertext="Adair County, IA"` as requestBody, that makes the website to put out the "Property Search" option, and make needed URL scrapable. Back to your solution I noticed that you already solve it without too much effort, and I'm thankful for that, but the truth is that I feel really challenged by being unable to scrape the URLs for all the Iowa counties, which is my real goal... – Antonio Graterol Jun 11 '20 at 23:25
  • The result of my research led me to know that (using the Developer tool's Network tab) after selecting the "Adair County, IA", only one process is fired, it's called "1034" which is the same `value` of that option, its type is xhr (XmlHttpRequest) and it's initiated by a "home.min.js:3" (which I checked and I think is a 10000-line javascript routine that I don't understand). I could also see that Request URL is https://beacon.schneidercorp.com/api/globalsearch/announcements/1034 and Request Method is GET... **After all I'd like to know if that can be mimicked** – Antonio Graterol Jun 11 '20 at 23:27
  • So when you say for all of the Iowa counties, did you find the list in the html? I've added that to my answer. The JS that is fired is just to narrow down the dropdownlist. What I really need to know is, once you've selected Iowa and a county (like Adair) and you move on to the Property Search - **what do you do next?** and what info are you wanting to pull from that next page? – jamheadart Jun 12 '20 at 06:15
  • I'm going to show the processes that I need to cover 1. Open beacon.schneidercorp.com, and the for for all Iowa counties 2. Select a county 3. Go to property search4. In the next page, click Comp Search 5. In the next page, go to Agricultural Comparable Search, set all 5-month-old sales and click the corresponding search button 6. In the next page, harvest specific information for all the sales with prices higher than zero. – Antonio Graterol Jun 12 '20 at 10:49
  • Before zwenn's solution I successfully to scraped values and inner text for all Iowa counties but I couldn't find how to get the URL brought after the Property Search button is shown. With your solution I was able to continue without using any internet explorer object... But as I said before I feel challenged given that I couldn't find a way to retrieve those URLs – Antonio Graterol Jun 12 '20 at 10:50
  • Ok so most likely everything you do up to step 5 is building up one big request and it's most likely that you can send a single request with all the correct parameters and get to step 6 in one go. It'd probably be a big request with a POST body that will need some variables, I will have another look soon but it could be tomorrow by the time I post another reply. – jamheadart Jun 12 '20 at 10:56
  • Yes, one big query is my way to go... Thanks, it's nice to know that you are around – Antonio Graterol Jun 12 '20 at 11:02
  • It seems that every county can have different pages. Some of them need logins, some of them have more than one comp search... I tried a couple of generic URL calls using `Q=1205421259` as the final parameter to one or two of the counties that had the same format, but I couldn't even find any results manually to compare against. If you can find two different result sets fitting your 5 month date agriculture comp search criteria, from two counties that I can see I may be able to help, otherwise it'd take a lot of time I couldn't invest! – jamheadart Jun 13 '20 at 08:33
  • Finally I could get solution but the rose a new problem which will be posted... I made a `POST` request using the last URL before having the results that I need to scrape, as searchBody I used all the values inside the Form Data subsection of Header section of the Developer Toolkit's Network Tab and it worked, but when I started the scraping process the recieved html code was different than the one displayed in the browser. You'll see... **Is there a chance that the starting URL suddenly might change?** If it does my solution is doomed! – Antonio Graterol Jun 14 '20 at 02:32