6

Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.

I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such

all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
    a = promo.find_elements_by_tag_name("a")
    print("a[0]: ", a[0].get_attribute("href"))

However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.

Are there any ways around getting the links of all these items?

Edit: Are there any ways to retrieve all the links of the items on the pages?

i.e.

https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...

Edit: Adding an image of one such anchor tag for better clarity: enter image description here

Max
  • 834
  • 8
  • 19

2 Answers2

3

By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:

window.__NUXT__.state.Promotion.promotions[0].HappeningID

Based on that, you can create a Python loop to get all the promotions:

items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
    base = "https://sunteccity.com.sg/promotions/"
    happening_id = str(item["HappeningID"])
    print(base + happening_id)

That generated the following output:

https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
Michael Mintz
  • 9,007
  • 6
  • 31
  • 48
  • 2
    Just Awesome...I did explore that js file but couldn't reach to the happening id you have mentioned..will check again tomorrow – Gurmanjot Singh Jan 15 '22 at 20:53
  • 1
    Hey Michael, thanks for your answer - I'm curious if you have tried it on headless browser mode, `options.add_argument('headless')`, wondering if it still works for you. It seems to return the links only half the times (tried running about 10 times) – Max Jan 16 '22 at 02:35
  • 1
    Hi Max, it's working for me on ``headless`` mode too. Maybe you need a delay of a few seconds between the page load and the loop. I'm running everything with the [SeleniumBase](https://github.com/seleniumbase/SeleniumBase) framework, in case there's a difference. The SeleniumBase headless mode adds other command-line options to make things run cleaner, and to avoid bot-detection. – Michael Mintz Jan 16 '22 at 03:05
0

You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be

all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
    a = promo.find_elements_by_tag_name("a")
    print("a[0]: ", a[0].get_attribute("href"))

You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:

links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
    print(link.get_attribute("href"))
Prophet
  • 32,350
  • 22
  • 54
  • 79
  • I don't think this returns the results I am looking for because there is no attribute `href` in the a tags... – Max Jan 16 '22 at 02:36
  • Well... I'm sorry. I see. there is no links inside the web elements, they are containing the images only. I see Michael's solution above, it's interesting, however it's done with JavaScript reverse engineering, not with Selenium. Looks like the links are generated by JavaScript after clicking on the elements only. – Prophet Jan 16 '22 at 07:46