Get specific a hrefs with BeautifulSoup

Question

I'm attempting to scrape all the links contained in the boxes of this website. However, my mode doesn't return anything. What am I doing wrong? If I generally look for 'a' with href=True I don't get the links I'm looking for.

import requests
from bs4 import BeautifulSoup

url = 'https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&page=1&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO'
page = requests.get(url)  
soup = BeautifulSoup(page.content, 'lxml')

ahrefs = soup.find_all('a', {'class': "article-link" , 'href': True})
for a in ahrefs:
    print(a.text)

What exactly do you want to select? `True` is not valid value of hyper-reference attribute. Also note that `href` is mandatory attribute of link (without `@href` link is just a string), so there is no need to *select link only if it has `href` attribute* (if you mean that) — Andersson, Nov 07 '18 at 15:18
@Andersson Even if I omit the href (because a string would also be fine), I don't get anything. I would like all the urls in the blocks. The Xpath is `//*[@id="search-results-container"]/div/div[1]/div[10]/article/job/a` and CSS selector `#search-results-container > div > div.search-items.ng-scope > div:nth-child(2) > article > job > a`(don't know if that information helps) — Lunalight, Nov 07 '18 at 15:22
You can't use BeautifulSoup here (dynamic content) ..but you could parse this json: https://www.nationalevacaturebank.nl/vacature/zoeken.json?query=&location=&distance=city&page=1&limit=100&sort=date&filters[careerLevel][]=Starter&filters[educationLevel][]=MBO — t.m.adam, Nov 07 '18 at 15:31
@t.m.adam why not? I want to scrape several pages so I don't think I want to make jsons all the time. — Lunalight, Nov 07 '18 at 15:34
As I said the content is dynamic, so you can't get it with requests and BequtifulSoup. You could use Selenium, but even then you wouldn't have to use BeautifulSoup, as Selenium has its own selectors. — t.m.adam, Nov 07 '18 at 15:37
I'm running my code in another program that uses the python API. The code I wrote in Selenium stopped working, every url returns that the page doesn't exist (it does in the browser) — Lunalight, Nov 07 '18 at 15:40
@Andersson Yes, 'a' tags would usually have a 'href' attribute, but that's not guaranteed (see https://stackoverflow.com/questions/10510191/valid-to-use-a-anchor-tag-without-href-attribute for example). I think It doesn't hurt to check if links have a 'href', although in most cases that would be redundant. — t.m.adam, Nov 07 '18 at 15:58
@Lunalight The two URLs are almost identical, the only difference I can see is the 'json' parameter. You could just use the orifinal URL and replace '/vacature/zoeken' with '/vacature/zoeken.json'. — t.m.adam, Nov 07 '18 at 16:03
@t.m.adam , I didn't mean that link without `@href` is *invalid node*, it's just make no sense for me to set `href` for just partial number of links of the same class. So it looks like OP faced with the X-Y problem — Andersson, Nov 07 '18 at 16:13
@Lunalight , what do you mean by *"The code I wrote in Selenium stopped working, every url returns that the page doesn't exist"*? What is your current output? — Andersson, Nov 07 '18 at 16:14
@Andersson I checked with requests and it looks like the website is blocking me through the software I use to run the script in (requests got a 404). It still works with jupyter notebook though. — Lunalight, Nov 08 '18 at 09:34
@Lunalight 404 status usually means that required resource is not found, but yeah sometimes it might also mean that resource does exist, but server doesn't want you to know about its existence... I'm not sure what *"still works with jupyter notebook"* means as obviously required content is dynamic and cannot be scraped from shared link... Did you try Bertrand Martel answer or Selenium solution with ExplicitWait implemented? — Andersson, Nov 08 '18 at 09:44
@Andersson The selenium scraper where I get all the links works in Jupyter but not in Alteryx (the software I use to run my code and use the results for further processing). I first thought the problem was with Selenium but since I got a 404 with requests in Alteryx I think the website server doesn't want me to know the site exists or something. — Lunalight, Nov 08 '18 at 10:25

score 2 · Accepted Answer · answered Nov 08 '18 at 01:11

This is an angular websites which loads its content dynamically from an external Json API. The api is located here : https://www.nationalevacaturebank.nl/vacature/zoeken.json and needs a cookie to be set. The following will format the links you wanted to extract :

import requests

r = requests.get(
    'https://www.nationalevacaturebank.nl/vacature/zoeken.json',
    params = {
        'query': '',
        'location': '',
        'distance': 'city',
        'page': '1,110',
        'limit': 100,
        'sort': 'date',
        'filters[careerLevel][]': 'Starter',
        'filters[educationLevel][]': 'MBO'
    },
    headers = {
        'Cookie': 'policy=accepted'
    }
)

links = [
    "/vacature/{}/reisspecialist".format(t["id"])
    for t in r.json()['result']['jobs']
]

print(links)

The Json result also gives you all card metadata embedded in this page

How do you go over all the pages with this solution? This only gives a 100 results back — Lunalight, Nov 12 '18 at 08:37
@Lunalight try playing with the page & limit params for instance page=1 limit=100 then page=2 limit=100 — Bertrand Martel, Nov 12 '18 at 11:43

Get specific a hrefs with BeautifulSoup

1 Answers1