0

My code successfully scrapes the table class tags from https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false

However, there are multiple pages available at the site above in which I would like to be able to scrape all the codes in each page. (The first column of the table in each page)

For example, with the url above, when I click the link to "2" the overall url does NOT change. I am not also able to find the hidden link of each page, however, I am able to see all the tables in every pages under source.

It seems quite similar to this: Scrape multiple pages with BeautifulSoup and Python

However, I can not find the source for page number under network.

How can my code be changed to scrape data from all the available listed pages?

My code that works for page 1 only:

import bs4 as bs
import pickle
import requests

def save_hkex_tickers():
  resp = requests.get('https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false')
  soup = bs.BeautifulSoup(resp.text, "lxml")
  table = soup.find('table',{'class':'greygeneraltxt'})
  tickers = []
  for row in table.findAll('tr')[2:]:
    ticker = row.findAll('td')[1].text
    tickers.append(ticker)

  print(tickers)
  return tickers

save_hkex_tickers()
MKYJ
  • 11
  • 3
  • You can think of selenium to get around that. – SIM Oct 15 '17 at 19:18
  • Thanks for your reply, Shahin. I am not really familiar with selenium. Does it like BeautifulSoup? I am thinking to create one more for loops inside the for loops above. Not sure if it can retrieve all the first columns in every table. – MKYJ Oct 16 '17 at 03:05
  • It is not only selenium you can go with, anything that has the capability to initiate click events can do the trick. – SIM Oct 16 '17 at 07:30
  • Thanks. By the way, is it possible to do it with BeautifulSoup? Sorry that I am not quite familiar to use selenium and other database which has the capability to initiate click events. From the above link, I already can see all the first columns' data in sources (Ctrl + U in Chrome). However, I am just only able to retrieve the first column in the first page. – MKYJ Oct 16 '17 at 09:17
  • No. BeautifulSoup doesn't have that property to initiate a click. Btw, if you wanna move on to next page you have no choice other than clicking on this link `javascript:goPage(2);` or I don't know if another approach is available. – SIM Oct 16 '17 at 10:35
  • Hey!! Check out this [Link](https://stackoverflow.com/questions/46773924/scraper-unable-to-get-names-from-next-pages) same question same website but satisfactory solution. – SIM Oct 16 '17 at 16:58
  • Thanks. Would you be able to provide me the script? I just wonder if all the data are available in the source page, why we need to clicking the link for going to next page? Sorry that I am a beginner in Python. – MKYJ Oct 16 '17 at 17:13
  • Oh I missed your previous comment. Let me take a look on it. – MKYJ Oct 16 '17 at 17:14
  • Thanks Shahin. That link solved all of my questions. – MKYJ Oct 20 '17 at 18:03
  • Possible duplicate of [Scraper unable to get names from next pages](https://stackoverflow.com/questions/46773924/scraper-unable-to-get-names-from-next-pages) – jdoe Oct 27 '17 at 22:48

0 Answers0