-4

I am trying to scrape https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN Below is the code.

import requests
from bs4 import BeautifulSoup as BS

sess = requests.session()
html = sess.get(url,headers={'User-Agent': 'Mozilla/5.0'},allow_redirects=True)
Soup = BS(html.text,'lxml')
with open('ocswssw.html,'w') as f:
    print(Soup.prettify())

if you compare the ocswssw.html and the website in chrome. they don't match.

but some how the source code I am receiving is not complete. Please let me know what went wrong.

I don't like to use selenium where browser popups.

3 Answers3

0

I am not completely clear on what you are trying to accomplish ultimately, but when it comes to receiving the source I:

1) Added the missing apostrophe for your ocswssw.html argument using the open() method and

2) Ran the code and received pretty much the same source as Google Chrome provides.

Result from BS:

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   OCSWSSW | Member Search
  </title>
  <link href="/Thinclient/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css"/>
  <link href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css"/>
  <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/icheck/square/blue.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/GlobalStyleSheet.css" rel="stylesheet"/>
  <script type="text/javascript">
   HomeURL = "#/forms/new/?table=0x800000000000003D&amp;form=0x800000000000004D&amp;command=0x8000000000000C2D";
        AfterLoginData = null

        LanguageDictionary = {};

        LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}

        LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}

        LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}

        LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}

        LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
  </script>
  <script src="/Thinclient/Scripts/jquery-1.11.1.min.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/icheck.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/kendo.all.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js">
  </script>
  <script>
   kendo.culture("en-US");
  </script>
 </head>
 <body class="k-content">
  <div class="k-loading-mask" id="loadingMsg" style="width:100%;height:100%">
   <span class="k-loading-text">
    Loading...
   </span>
   <div class="k-loading-image">
    <div class="k-loading-color">
    </div>
   </div>
  </div>
  <input id="hdPollingFrequency" type="hidden" value="32767"/>
  <input id="hdPrivateComputerTimeout" type="hidden" value="32767"/>
  <input id="hdPublicComputerTimeout" type="hidden" value="32767"/>
  <input id="hdWarningDisplayDuration" type="hidden" value="0"/>
  <input id="hdWindowsAuthentication" type="hidden" value="false"/>
  <div class="container">
   <div id="content">
   </div>
  </div>
  <div id="loading" style="display: none;">
   <h1>
    We are processing your request. Please be patient.
   </h1>
   <input class="abortButton" type="button" value="Abort"/>
  </div>
  <script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
   <div class="panelBlock">
            <div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
            <a class="imgPanelDD" href="#">&nbsp;</a></div>
            </div>
                    <div class="panelContent1" id="panelContent1 + ${DisplayName}">
                        <ul>
                            {{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
                        </ul>                   
                    </div>            
            </div>
  </script>
  <script id="KendoTestTemplate" type="text/x-kendo-template">
   <h2>#= test #</h2>
        <ul>
                        #= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
        </ul>
  </script>
  <script id="KendoTestLiTemplate" type="text/x-kendo-template">
   <li>#= displayName#</li>
  </script>
  <script id="ErrorTemplate" type="text/x-jquery-tmpl">
   <div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
            <div class="k-notification-wrap">
                <span class="k-icon k-i-note">
                    error
                </span>
                ${errorMsg}
                <span class="k-icon k-i-close">
                    Hide
                </span>
            </div>
        </div>
  </script>
  <script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
   <button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
  </script>
  <script id="IconTemplate" type="text/x-jquery-tmpl">
   <span class="k-icon ${icon}"></span>
  </script>
  <script id="trash" type="text/x-kendo-template">
   <li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
    <li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
  </script>
  <script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
   {{if ImageId}}
            <li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
      {{else}}
        <li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
       {{/if}}
  </script>
  <script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
   <button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
                            <span>${DisplayName}</span>
        </button>
  </script>
  <script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript">
  </script>
 </body>
</html>

Result from Browser source

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>OCSWSSW | Member Search</title>
    <link href="/Thinclient/favicon.ico" type="image/x-icon" rel="shortcut icon" />
    <link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css" />
    <link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" />
    <link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" />
    <link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css" />
    <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">    
    <link rel="stylesheet" href="/Thinclient/Content/icheck/square/blue.css" />

            <link rel="stylesheet" href="/Thinclient/Content/GlobalStyleSheet.css" />


    <script type="text/javascript" >
        HomeURL = "#/forms/new/?table=0x800000000000003D&amp;form=0x800000000000004D&amp;command=0x8000000000000C2D";
        AfterLoginData = null

        LanguageDictionary = {};

        LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}

        LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}

        LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}

        LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}

        LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
    </script>
    <script type="text/javascript" src="/Thinclient/Scripts/jquery-1.11.1.min.js"></script>
    <script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript"></script>

    <script src="/Thinclient/Scripts/icheck.min.js"></script>
    <script src="/Thinclient/Scripts/kendo/kendo.all.min.js"></script>
    <script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js"></script>
    <script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js"></script>

    <script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js"></script>
    <script>
        kendo.culture("en-US");
</script>




</head>
<body class="k-content">
    <div id="loadingMsg" class="k-loading-mask" style="width:100%;height:100%">
        <span class="k-loading-text">Loading...</span>
        <div class="k-loading-image">
            <div class="k-loading-color"></div>
        </div>
    </div>

    <input type="hidden" ID="hdPollingFrequency" value="32767"/>
    <input type="hidden" ID="hdPrivateComputerTimeout" value= "32767"/>
    <input type="hidden" ID="hdPublicComputerTimeout" value="32767"/>
    <input type="hidden" ID="hdWarningDisplayDuration" value="0"/>
    <input type="hidden" id="hdWindowsAuthentication" value="false"/>



<div class="container">
    <div id="content">

    </div>
</div>

    <div id="loading" style="display: none;">
        <h1>
            We are processing your request. Please be patient.</h1>
        <input type="button" value="Abort" class="abortButton" />
    </div>
    <script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
        <div class="panelBlock">
            <div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
            <a class="imgPanelDD" href="#">&nbsp;</a></div>
            </div>
                    <div class="panelContent1" id="panelContent1 + ${DisplayName}">
                        <ul>
                            {{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
                        </ul>                   
                    </div>            
            </div>      
    </script>
    <script id="KendoTestTemplate" type="text/x-kendo-template">
        <h2>#= test #</h2>
        <ul>
                        #= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
        </ul>   
    </script>
    <script id="KendoTestLiTemplate" type="text/x-kendo-template">
        <li>#= displayName#</li>    
    </script>
    <script id="ErrorTemplate" type="text/x-jquery-tmpl">
        <div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
            <div class="k-notification-wrap">
                <span class="k-icon k-i-note">
                    error
                </span>
                ${errorMsg}
                <span class="k-icon k-i-close">
                    Hide
                </span>
            </div>
        </div>
    </script>
    <script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
        <button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
    </script>
    <script id="IconTemplate" type="text/x-jquery-tmpl">
        <span class="k-icon ${icon}"></span>
    </script>
    <script id="trash" type="text/x-kendo-template">
    <li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
    <li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>

    </script>
    <script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
     {{if ImageId}}
            <li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
      {{else}}
        <li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
       {{/if}}
    </script>
    <script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
        <button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
                            <span>${DisplayName}</span>
        </button>
    </script>
    <script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript"></script>      
    <script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript"></script>  
    <script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript"></script>
    <script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript"></script>
    <script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript"></script>

    <script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript"></script>
    <script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript"></script>










    <script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript"></script>




</body>
</html>


<script>
    //$( window ).load( 
    //$(".k-state-default").hover(function () {
    //    $(this).toggleClass("k-state-hover");
    //})
    //);
</script>

Is this not what you are looking for from Beautiful Soup?

Jonathan
  • 487
  • 1
  • 9
  • 19
0

The page is created by using javascript. So,you can't get page source by only using requests/bs4

how to resoleve:use HeadlessChrome which create page source created by javascript

0

it dynamic page (Ajax) you can't use bs4, if you dont like selenium where browser popups you can add --headless option to hide it. here example

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument('--headless')
#options.add_argument('--disable-gpu')  # maybe needed if running on Windows.
driver = webdriver.Chrome(chrome_options=options)

print("Loading Page...")
driver.get('https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN/')

# wait max 20 second until ajax content rendered
print("Wait Ajax finished...")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID , 'MainForm')))

html = driver.execute_script("return document.documentElement.outerHTML")
Soup = BeautifulSoup(html, 'html.parser')
with open('ocswssw.html', 'w') as f:
    sourceCode = Soup.prettify().encode('utf-8')
    f.write(sourceCode)
    print(sourceCode)

driver.quit()
ewwink
  • 18,382
  • 2
  • 44
  • 54