2

I'm trying to scrape news information from https://hk.appledaily.com/search/apple. And I need to get the news content from div class="flex-feature" but it only return []. Hope anyone could help, thank you!

from bs4 import BeautifulSoup
import requests


page = requests.get("https://hk.appledaily.com/search/apple")

soup = BeautifulSoup(page.content, 'lxml')

results = soup.find_all('div', class_ = "flex-feature")


print(results)
  • Hey, welcome to Stack Overflow. In the webpage source at your given URL, there seems to be nothing related to `flex-feature`? It seems what you're searching for is dynamic content, requests can grab static only content. Consider using other solutions like Selenium. – 0xInfection Oct 05 '20 at 17:06
  • @PinakiMondal see my answer below. If you view just the page source, `flex-features` is not there because this is the HTML prior to rendering content using JavaScript. If you use Inspect element however , you will be able to view the dynamic content and `flex-feature` will be there. – Chris Greening Oct 05 '20 at 17:08

2 Answers2

1

If you View page source in your browser, you'll see that flex-feature is nowhere in the HTML. This is the HTML that the server initially sends back before rendering JavaScript and all the dynamic content. This is also the same HTML that requests.get is going to give you ([]).

To access these elements, you'll likely want to use something such as Selenium that will allow you to automate a browser and render the JavaScript that is dynamically loading the page. Check out my answer to a similar question here for some insight!

Additional resources:

Chris Greening
  • 510
  • 5
  • 14
1

The data on that page is fetched and rendered dynamically (via js). So you wouldn't be able to fetch the data unless you evaluate the javascript.

One approach to scrape the data would be to use a headless browser.
Here is one such example using pyppeteer.

import asyncio
from pyppeteer import launch

# https://pypi.org/project/pyppeteer/

URL = 'https://hk.appledaily.com/search/apple'

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto(URL)

    await page.waitForSelector(".flex-feature")

    elements = await page.querySelectorAll('.flex-feature')
    
    for el in elements:
        text = await page.evaluate('(el) => el.textContent', el)
        print(text)


    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

output:

3小時前特朗普確診 不斷更新 特朗普新聞秘書及多名白宮職員確診 「白宮群組」持續擴大特朗普確診 不斷更新

 ... REDUCTED ...
Tibebes. M
  • 6,940
  • 5
  • 15
  • 36