1

I'm attempting to scrape all the links contained in the boxes of this website. However, my mode doesn't return anything. What am I doing wrong? If I generally look for 'a' with href=True I don't get the links I'm looking for.

import requests
from bs4 import BeautifulSoup

url = 'https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&page=1&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO'
page = requests.get(url)  
soup = BeautifulSoup(page.content, 'lxml')

ahrefs = soup.find_all('a', {'class': "article-link" , 'href': True})
for a in ahrefs:
    print(a.text)
Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
Lunalight
  • 157
  • 2
  • 15
  • What exactly do you want to select? `True` is not valid value of hyper-reference attribute. Also note that `href` is mandatory attribute of link (without `@href` link is just a string), so there is no need to *select link only if it has `href` attribute* (if you mean that) – Andersson Nov 07 '18 at 15:18
  • @Andersson Even if I omit the href (because a string would also be fine), I don't get anything. I would like all the urls in the blocks. The Xpath is `//*[@id="search-results-container"]/div/div[1]/div[10]/article/job/a` and CSS selector `#search-results-container > div > div.search-items.ng-scope > div:nth-child(2) > article > job > a`(don't know if that information helps) – Lunalight Nov 07 '18 at 15:22
  • You can't use BeautifulSoup here (dynamic content) ..but you could parse this json: https://www.nationalevacaturebank.nl/vacature/zoeken.json?query=&location=&distance=city&page=1&limit=100&sort=date&filters[careerLevel][]=Starter&filters[educationLevel][]=MBO – t.m.adam Nov 07 '18 at 15:31
  • @t.m.adam why not? I want to scrape several pages so I don't think I want to make jsons all the time. – Lunalight Nov 07 '18 at 15:34
  • As I said the content is dynamic, so you can't get it with requests and BequtifulSoup. You could use Selenium, but even then you wouldn't have to use BeautifulSoup, as Selenium has its own selectors. – t.m.adam Nov 07 '18 at 15:37
  • I'm running my code in another program that uses the python API. The code I wrote in Selenium stopped working, every url returns that the page doesn't exist (it does in the browser) – Lunalight Nov 07 '18 at 15:40
  • @Andersson Yes, 'a' tags would usually have a 'href' attribute, but that's not guaranteed (see https://stackoverflow.com/questions/10510191/valid-to-use-a-anchor-tag-without-href-attribute for example). I think It doesn't hurt to check if links have a 'href', although in most cases that would be redundant. – t.m.adam Nov 07 '18 at 15:58
  • @Lunalight The two URLs are almost identical, the only difference I can see is the 'json' parameter. You could just use the orifinal URL and replace '/vacature/zoeken' with '/vacature/zoeken.json'. – t.m.adam Nov 07 '18 at 16:03
  • @t.m.adam , I didn't mean that link without `@href` is *invalid node*, it's just make no sense for me to set `href` for just partial number of links of the same class. So it looks like OP faced with the X-Y problem – Andersson Nov 07 '18 at 16:13
  • @Lunalight , what do you mean by *"The code I wrote in Selenium stopped working, every url returns that the page doesn't exist"*? What is your current output? – Andersson Nov 07 '18 at 16:14
  • @Andersson Yes, that's possible; I see what you mean now. – t.m.adam Nov 07 '18 at 16:20
  • @Andersson I checked with requests and it looks like the website is blocking me through the software I use to run the script in (requests got a 404). It still works with jupyter notebook though. – Lunalight Nov 08 '18 at 09:34
  • @Lunalight 404 status usually means that required resource is not found, but yeah sometimes it might also mean that resource does exist, but server doesn't want you to know about its existence... I'm not sure what *"still works with jupyter notebook"* means as obviously required content is dynamic and cannot be scraped from shared link... Did you try Bertrand Martel answer or Selenium solution with ExplicitWait implemented? – Andersson Nov 08 '18 at 09:44
  • @Andersson The selenium scraper where I get all the links works in Jupyter but not in Alteryx (the software I use to run my code and use the results for further processing). I first thought the problem was with Selenium but since I got a 404 with requests in Alteryx I think the website server doesn't want me to know the site exists or something. – Lunalight Nov 08 '18 at 10:25

1 Answers1

2

This is an angular websites which loads its content dynamically from an external Json API. The api is located here : https://www.nationalevacaturebank.nl/vacature/zoeken.json and needs a cookie to be set. The following will format the links you wanted to extract :

import requests

r = requests.get(
    'https://www.nationalevacaturebank.nl/vacature/zoeken.json',
    params = {
        'query': '',
        'location': '',
        'distance': 'city',
        'page': '1,110',
        'limit': 100,
        'sort': 'date',
        'filters[careerLevel][]': 'Starter',
        'filters[educationLevel][]': 'MBO'
    },
    headers = {
        'Cookie': 'policy=accepted'
    }
)

links = [
    "/vacature/{}/reisspecialist".format(t["id"])
    for t in r.json()['result']['jobs']
]

print(links)

The Json result also gives you all card metadata embedded in this page

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159