0

I want to use javascript to get the data in a webpage.

In the first image, there is a highlighted link for you to click.

After that, you will get the webpage in the second image.

The desired data is highlighted.

I can get the data using requests and BeautifulSoup.

The data in the second image is retrieved using javascript from somewhere before displaying to us.

How to get the data using javascript?

import requests
from bs4 import BeautifulSoup
import lxml

fig1_url = r'https://huangshigongyuanzy.fang.com/'
fig2_url = r'https://huangshigongyuanzy.fang.com/house/2612049076/fangjia.htm'

headers = {'user-agent':r'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
resp = requests.get(fig2_url, headers=headers)
resp.encoding='GB18030'

soup = BeautifulSoup(resp.text, 'lxml')
for i in soup.find('div', {'id': 'priceListOpen'}).findAll('tr'):
  for j in i.findAll('td'):
    print(j.text+'|',end = ' ')
  print('\n' + '-'*50)

You can run the snippet here. enter image description here enter image description here

Chan
  • 3,605
  • 9
  • 29
  • 60
  • 1
    Please include the code in your question itself. – jhpratt Mar 01 '19 at 02:38
  • I feel, the page is rendered using Javascript. Which I would recommend to handle with headless browser (Chrome/ PhantomJS). You may follow the tutorials from here - https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/ – Kondasamy Jayaraman Mar 01 '19 at 03:00
  • `Selenium` requires `fig2_url`, which may be disabled. – Chan Mar 01 '19 at 03:50

2 Answers2

0

I opened the fig2_url and I found that this page was using server rendering. So, you have to use some tools to craw the data from it.

This article provided us with a perfect tutorial.

When you craw a website, first thing to look for are terms of use. Some Site explicitly address the possibility of using their website with scraping APIs. Always be sure to take a look at these before.

Vu Luu
  • 792
  • 5
  • 16
0

Have you looked into using Selenium webdriver? I was looking into it for another project and it looked promising. It actually invokes your browser, so it should do the javascript for you (I think):

from selenium import webdriver
fig2_url = r'https://huangshigongyuanzy.fang.com/house/2612049076/fangjia.htm'
driver = webdriver.Firefox()
driver.get(fig2_url)
driver.page_source.encode('GB18030')

soup = BeautifulSoup(driver.page_source, 'lxml')
for i in soup.find('div', {'id': 'priceListOpen'}).findAll('tr'):
  for j in i.findAll('td'):
    print(j.text+'|',end = ' ')
  print('\n' + '-'*50)

Seems to work.

Caveats:

  1. It relies on Mozilla geckodriver, on github.

  2. You can't fake the headers, as with requests; but then, since it's actually opening using Firefox, you probably don't need to.

  3. At least with the above code, it opens a browser and geckodrive window; there may be some way to suppress that, but I haven't looked into it. This was just a bare-bones attempt. [edit: the question How to hide Firefox window (Selenium WebDriver)? describes how to get around this with PhantomJS. I haven't tried it.]

codingatty
  • 2,026
  • 1
  • 23
  • 32
  • Thanks codingatty. `Selenium` cannot actually solve my problem as `fig2_url` may be disabled. I want to pass some parameters to run the javascript and get the results. – Chan Mar 01 '19 at 04:30