0

I apologise if this has been asked before. I am very new to everything.

I am trying to parse the page from the following website. However, the resultant script scraped is not the same script as the one I observe by inspecting the page source on Chrome.

import pandas as pd
from bs4 import BeautifulSoup as bsoup
import requests as rq

url_estates = "https://www.propertyguru.com.sg/singapore-property-listing/hdb"
headers = {"user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
client = rq.get(url_estates, headers = headers)
client
<Response [200]>

soup = bsoup(client.content, features = "html.parser")
print(soup.prettify())

The resultant script is also much shorter, half of which consists only of meta tags.

Do you know why I am unable to scrape the full script as per the page source? I read certain posts that mentioned that because the page consists of dynamic listings with javascript, it is recommended to use selenium instead. If so, I would also like to understand why BeautifulSoup is unable to perform the same function as selenium. Is there any way to scrape what I want without resorting to another library?

Many thanks in advance.

user13982022
  • 37
  • 1
  • 5
  • The answer to your question involves learning how a page is sent from the server to the browser, and what the browser does to display the full page. I'm afraid SO isn't really suited to that kind of tutorial material. – thebjorn Jul 23 '20 at 11:22
  • What exactly do you mean by "script"? If you mean the HTML source, you should probably [edit] to clarify your wording; "script" in this context usually refers to JavaScript code embedded in the HTML source within ` – tripleee Jul 23 '20 at 11:25
  • If you mean literally "explain what Selenium has and BeautifulSoup doesn't", the answer is basically "a full JavaScript engine attached to a web browser with a full DOM". See also https://stackoverflow.com/questions/51117692/beautifulsoup-does-not-see-element-even-though-it-is-present-on-a-page – tripleee Jul 23 '20 at 11:31

1 Answers1

0

Beautiful Soup reads HTML that has been generated without firing extra Javascript events. Selenium is a webdriver that can be used to simulate user behavior. They are two different tools that can be used together when scraping a web page with dynamic content. See this blog post for help understanding issues surrounding scraping javascript heavy sites and this SO post for an explanation of how to tie the various tools together.

It could also be that you are hitting a site which does not allow scraping and the request is returning a message stating as much. It is hard to tell without seeing the output.

Nathan
  • 3,082
  • 1
  • 27
  • 42