0

I am scraping this site and I need to get the salary value from it as shown in the image

I have tried to do the flowing:

import requests
from bs4 import BeautifulSoup
result = requests.get("https://wuzzuf.net/jobs/p/xGYIYbJlYhsC-Senior-Python-Developer-Cairo- Egypt?o=1&l=sp&t=sj&a=python|search-v3|hpb")
page = result.content
soup = BeautifulSoup(page, "lxml")
salaries_div = soup.find_all("div",{"class":"css-rcl8e5"})
for span in salaries_div[3].select("span"):
    print (span)

But I am only getting this span

<span class="css-wn0avc">Salary<!-- -->:</span>

My question is why I can't get all the span inside the div? And what should I do to get salary value in this case?

Mhd O.
  • 120
  • 1
  • 8

2 Answers2

0

Since Beautiful Soup is just a parser that works with the content you provide it with, it has nothing to do with page retrieval or rendering.

The solution that I found in my case is to use selenium to get JS rendered page.

The working code:

from bs4 import BeautifulSoup
from webdriver_manager import driver
from webdriver_manager.chrome import ChromeDriver, ChromeDriverManager
from selenium import webdriver

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://wuzzuf.net/jobs/p/xGYIYbJlYhsC-Senior-Python-Developer-Cairo-Egypt?o=1&l=sp&t=sj&a=python|search-v3|hpb")

page = driver.page_source
soup = BeautifulSoup(page, "lxml")
salaries_div = soup.find_all("div",{"class":"css-rcl8e5"})
for span in salaries_div[3].select("span"):
    print (span)
Mhd O.
  • 120
  • 1
  • 8
0

If the content on your page is generated by JavaScript, try Selenium. I think it has all the functionality you need. Your code will then look like this:


### Let's import Selenium!
from selenium.webdriver import Firefox,FirefoxOptions
### At first, we need to say Selenium it should not show graphical window, so we will use Firefox in headless mode.
### We do so by creating instance of FirefoxOptions and setting its attribute 'headless' to True
opt=FirefoxOptions()
opt.headless=True
### Now, we create the actual Firefox instance and we pass it our FirefoxOptions as keyword argument 'options'
ffx=Firefox(options=opt)
### We visit your website with ffx.get()
ffx.get("https://wuzzuf.net/jobs/p/xGYIYbJlYhsC-Senior-Python-Developer-Cairo- Egypt?o=1&l=sp&t=sj&a=python|search-v3|hpb")
### Let's now search for your spans with ffx.find_elements_by_css_selector()
elems=ffx.find_elements_by_css_selector("div.css-rcl8e5:nth-child(5)>span")
### And print the elements
for elem in elems:
    print(elem.get_attribute('outerHTML'))

This (at least at my case) outputs:

<span class="css-wn0avc">Salary<!-- -->:</span>
<span class="css-47jx3m"><span class="css-4xky9y">Confidential</span></span>

To access the second element, use elems[-1], and elems[-1].get_attribute('outerHTML') to get its html source.

But do not forget to install Selenium with

pip install selenium

And you should have Firefox with geckodriver installed.

Adam Jenča
  • 582
  • 7
  • 20
  • Thank you, try to use [webdriver_manager](https://pypi.org/project/webdriver-manager) so you don't need to have a geckodriver – Mhd O. Sep 02 '21 at 06:34