-1

This isn't just a simple how to retrieve links question. When I scrape a page, the href link returns something like '/people/4849247002', but if you inspect the page itself this href URL actually links to 'https://website/people/4849247002' if you click it. how can I get the link with 'https://website/people/4849247002' instead?

also side note, but what's the correct way to use BeautifulSoup to get a webpage? I've been using both of the following:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)

and

import requests
from bs4 import BeautifulSoup
import re
import time

source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')

I'm currently using python 3.8

Jorge Morgado
  • 1,148
  • 7
  • 23
chadlei
  • 179
  • 1
  • 11
  • What url are you exactly scraping, because right now you're asking three questions, not one. – baduker Oct 10 '20 at 00:04
  • https://boards.greenhouse.io/adhocexternal this url – chadlei Oct 10 '20 at 00:06
  • Does this answer your question? [Scrape the absolute URL instead of a relative path in python](https://stackoverflow.com/questions/44001007/scrape-the-absolute-url-instead-of-a-relative-path-in-python) – AMC Oct 10 '20 at 00:09
  • this is kinda the same as me just putting the base URL as a string and doing base + href_link, I was hoping for an answer that could find the same link that you would get if you clicked it – chadlei Oct 10 '20 at 00:13

1 Answers1

3

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req

url = 'https://boards.greenhouse.io/adhocexternal'
html = req.get(url)
doc = SimplifiedDoc(html)
print (doc.listA(url).url) # Print all links
# Or
lstA = doc.selects('a@data-mapped=true>href()')
print ([utils.absoluteUrl(url, a) for a in lstA])

Result:

['https://adhoc.team/join/', 'https://adhoc.team/blog/', 'https://boards.greenhouse.io/adhocexternal/jobs/4877141002', 'https://boards.greenhouse.io/adhocexternal/jobs/4877155002', 'https://boards.greenhouse.io/adhocexternal/jobs/4869701002', 'https://boards.greenhouse.io/adhocexternal/jobs/4877146002', ...
['https://boards.greenhouse.io/adhocexternal/jobs/4877141002', 'https://boards.greenhouse.io/adhocexternal/jobs/4877155002', ...

Or you can use the framework directly.

from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils

class MySpider(Spider):
    name = 'greenhouse'
    start_urls = ['https://boards.greenhouse.io/adhocexternal']

    def extract(self, url, html, models, modelNames):
        doc = SimplifiedDoc(html)
        urls = doc.listA(url.url)
        data = doc.title # Whatever data you want to get
        return {'Urls': urls, 'Data': data}


SimplifiedMain.startThread(MySpider())  # Start download
dabingsou
  • 2,469
  • 1
  • 5
  • 8
  • this gets me the full urls! definitely answers the question. not sure if you could help with this too, but my overall goal is to get URLs only if they're for a specific position. so I would need to search for all divs then collect the URLs within ONLY if the job title is what I'm seeking – chadlei Oct 10 '20 at 00:52
  • @chadlei Try this: print (doc.listA(url).contains('adhocexternal/jobs/', attr='url').url) – dabingsou Oct 10 '20 at 00:55
  • @chadlei Or: print (doc.listA(url, start='id="filter-count"', end='id="footer"').url) – dabingsou Oct 10 '20 at 00:58
  • how would i implement this if i wanted to put it in my for loop? it looks like this: for div in divs: for link in anchors: if anchor.text has keywords then save the URL (insert your simplified_scrapy line here) – chadlei Oct 10 '20 at 00:59
  • Do you mean this? lstA = doc.listA(url, start='id="filter-count"', end='id="footer"').url for a in lstA: print (a) – dabingsou Oct 10 '20 at 01:02
  • @chadlei The comment was not loaded completely just now. lstA = doc.listA(url, start='id="filter-count"', end='id="footer"') for a in lstA: if 'UX' in a.title: print (a) – dabingsou Oct 10 '20 at 01:10
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/222806/discussion-between-dabingsou-and-chadlei). – dabingsou Oct 10 '20 at 01:12