-2

Problem

I'm scraping a dynamic page - one that appears to be loading results from a database for display - but, it appears, I'm only getting the placeholder for the text elements rather than the text itself. The page I'm loading is: https://www.bricklink.com/v2/search.page?q=8084#T=S

Expected / Actual

Expected:

<table>
  <tr>
    <td class="pspItemClick">
      <a class="pspItemNameLink" href="www.some-url.com">The Name</a>
      <br/>
      <span class="pspItemCateAndNo">
        <span class="blcatList">Catalog Num</span> : 1111
      </span>
    </td>
  </tr>
</table

Actual

<table>
  <tr>
    <td class="pspItemClick">
      <a class="pspItemNameLink" href="[%catalogUrl%]">[%strItemName%]</a>
      <br/>
      <span class="pspItemCateAndNo">
        <span class="blcatList">[%strCategory%]</span> : [%strItemNo%]
      </span>
    </td>
  </tr>
</table

Attempted Solutions

  1. I first just tried loading the site using the requests library which, of course, didn't work since it's not a static page.
def load_page(url: str) -> BeautifulSoup:
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return BeautifulSoup(req.content, 'html.parser')
  1. I then tried Selenium's webdriver to load the dynamic content:
def html_source_from_webdriver(url: str, wait: int = 0) -> BeautifulSoup:
    browser = webdriver.Chrome(service=selenium_chrome_service, chrome_options=options)
    browser.implicitly_wait(wait)
    browser.get(urljoin(ROOT_URL, url))

    page_source = browser.page_source
    return BeautifulSoup(page_source, features="html.parser")

Both attempts yield the same results. I haven't used the implicitly_wait feature much so I was just experimenting with different values (0-15) - none of which worked. I've also tried the browser.set_script_timeout(<timeout>) which also did not work.

Any thoughts on where to go from here would be greatly appreciated.

Update

I appreciate those of you providing suggestions. I've also tried the following with no luck:

  • using time.sleep() - added after the browser.get(...) call.
  • using browser.set_page_load_timeout() - didn't expect this one to work, but tried anyway.
James B
  • 447
  • 3
  • 15
  • Does this answer your question? [Make Selenium wait 10 seconds](https://stackoverflow.com/questions/45347675/make-selenium-wait-10-seconds) – Abdul Aziz Barkat May 30 '23 at 15:38
  • `implicitly_wait` is probably not useful for you because the element you are looking for is **already** present (Although you don't seem to be using Selenium for getting the element so it won't matter anyway). You need to explicitly wait a bit for that elements value to be updated. – Abdul Aziz Barkat May 30 '23 at 15:39
  • @AbdulAzizBarkat Thank you. Unfortunately, the answer in the link you provided didn't work for me. As far as the `implicitly_wait`, I was kind of just throwing whatever I could at it. Other dynamic pages seem to load fine, so this one is throwing me off a bit. This is a first go at scraping so I'm doing some trial by fire. – James B May 30 '23 at 15:54

1 Answers1

0

Here is one way of getting that information (you start by inspecting Network tab in browser, when loading the page, and looking for any calls made to various APIs via XHR, WS):

import requests
import pandas as pd

headers= {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
}

url = 'https://www.bricklink.com/ajax/clone/search/searchproduct.ajax?q=8084&st=0&cond=&type=&cat=&yf=0&yt=0&loc=&reg=0&ca=0&ss=&pmt=&nmp=0&color=-1&min=0&max=0&minqty=0&nosuperlot=1&incomplete=0&showempty=1&rpp=25&pi=1&ci=0'

r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['result']['typeList'], record_path = ['items'])
print(df)

Result in terminal:

    idItem  typeItem    strItemNo   strItemName     idColor     idColorImg  cItemImgTypeS   bHasLargeImg    n4NewQty    n4NewSellerCnt  mNewMinPrice    mNewMaxPrice    n4UsedQty   n4UsedSellerCnt     mUsedMinPrice   mUsedMaxPrice   strCategory     strPCC
0   95924   S   66364-1     Star Wars Bundle Pack, Super Pack 3 in 1 (Sets...   -1  0   J   True    3   3   CZK 3,839.17    CZK 4,145.02    0   0   CZK 0.00    CZK 0.00    65.258  None
1   95927   S   66368-1     Star Wars Bundle Pack, Super Pack 3 in 1 (Sets...   -1  0   J   True    7   4   CZK 2,889.67    CZK 3,974.78    0   0   CZK 0.00    CZK 0.00    65.258  None
2   88129   S   8084-1  Snowtrooper Battle Pack     -1  0   G   True    157     51  CZK 473.72  CZK 3,552.88    109     79  CZK 231.09  CZK 884.62  65.258  None
3   95085   C   c09se2  2009 Large Swedish July - December (456.8084-SV)    -1  -1  None    False   0   0   CZK 0.00    CZK 0.00    1   1   CZK 117.24  CZK 117.24  647     None
4   210835  G   SW4AM2  Display Assembled Set, Star Wars Sets 8083, 80...   -1  11  J   True    0   0   CZK 0.00    CZK 0.00    0   0   CZK 0.00    CZK 0.00    848.65.258  None
5   88128   I   8084-1  Snowtrooper Battle Pack     -1  0   J   True    590     74  CZK 0.27    CZK 84.42   605     302     CZK 0.22    CZK 221.21  65.258  None
6   95922   O   66364-1     Star Wars Bundle Pack, Super Pack 3 in 1 (Sets...   -1  0   J   True    0   0   CZK 0.00    CZK 0.00    1   1   CZK 473.72  CZK 473.72  65.258  None
7   95925   O   66368-1     Star Wars Bundle Pack, Super Pack 3 in 1 (Sets...   -1  0   J   True    0   0   CZK 0.00    CZK 0.00    0   0   CZK 0.00    CZK 0.00    65.258  None
8   88127   O   8084-1  Snowtrooper Battle Pack     -1  0   G   True    3   3   CZK 60.55   CZK 236.86  9

See relevant documentation for packages used: pandas and requests

Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30