0

I want to scrape the following web page:

https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=07/16/2019

As you can see, there is lots of data, yet when I "show source", the following html for the data of interest is all there is. Where is all the data coming from? How can something be displayed that isn't in the html?

<div class="Head_W">
    <div tabindex="0"  tabindex="0"  class="Sub_Title">Auctions Waiting</div>
    <div   class="Fadebar"></div>
        <div tabindex="0"  class="PageFrame" area="W">
            <span class="PageLeft"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle"  /></span>
            <span tabindex="0" class="PageText">page <input id="curPWA" type="text" curPG="" />  of <span id="maxWA"></span> </span>
            <span class="PageRight"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
        </div>
    <div   id="Area_W" class="Auct_Area" ref="Y" arid="W">
        <div tabindex="0"  class="Loading"></div>
    </div>
    <div  class="Fadebar"></div>
        <div tabindex="0"  class="PageFrame" area="W">
            <span class="PageLeft"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle"  /></span>
            <span tabindex="0"class="PageText">page  <input id="curPWB" type="text" curPG=""/>  of <span id="maxWB"></span> </span>
            <span class="PageRight"><img src="/CORE/System/Themes/Theme_1/Images/Common/blank.gif" alt="" width="41" height="16" align="absmiddle" /></span>
        </div>
</div>
Phil
  • 157,677
  • 23
  • 242
  • 245
user3217883
  • 1,216
  • 4
  • 38
  • 65
  • 1
    _"Where is all the data coming from?"_ https://en.wikipedia.org/wiki/Ajax_(programming) – Phil Jul 15 '19 at 04:35
  • 1
    Possible duplicate of [Scraping dynamic content in a website](https://stackoverflow.com/questions/8323728/scraping-dynamic-content-in-a-website) – Phil Jul 15 '19 at 04:37

1 Answers1

1

The website https://charlotte.realforeclose.com uses AJAX. You need to do some reverse engineering job to find out how does it work.

Open Chrome, press F12 to open Developer Tools or choose the option from the menu.

open chrome dev tools

Open Network tab, choose XHR filter, paste the URL https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=07/16/2019 to the browser address bar and press enter. Check XHRs logged on Network tab while the webpage is loading. First of all inspect XHRs having bigger response size.

XHRs

Click on the request in the list and check details. Here are URL, headers and parameters for request.

XHR request details

And the response content.

XHR response

Since the requests method is GET, you can just paste the URLs to address bar and retrieve the content. The URLs for me are:

https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=0&doR=1&tx=1563171184890&bypassPage=1&test=1&_=1563171184890
https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=C&PageDir=0&doR=1&tx=1563171185129&bypassPage=0&test=1&_=1563171185129

After playing a bit, you can easily find that parameter AREA=W is for "Auctions Waiting" section, and AREA=C is for "Auctions Closed or Canceled" section. Seems the parameters tx, bypassPage, test and _ are not necessary at all.

Open the first page with PageDir=0 and doR=1, after that navigate to next page with PageDir=1 and doR=0, and to previous page with PageDir=-1 and doR=0.

The first page https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=0&doR=1

response for first page

And the next page https://charlotte.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA=W&PageDir=1&doR=0

response for next page

Finally you just need to reproduce that XHRs from your application and parse the responses. Depending on implementation of HTTP requests you may need to add the necessary headers and cookies processing also.

omegastripes
  • 12,351
  • 4
  • 45
  • 96
  • But I still can't get the data into the program. Documented here: https://stackoverflow.com/questions/57046770/why-is-inputstreamreader-returning-different-content-than-browser – user3217883 Jul 15 '19 at 20:51
  • @user3217883 Please pay attention to the last statement: "Depending on implementation of HTTP requests you may need to add the necessary headers and cookies processing also". – omegastripes Jul 15 '19 at 21:15