0

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.

enter image description here

However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.

My questions are:

  1. What is happening here? How can a value be dynamically loaded into a page, displayed, and then not appear within the HTML source?

  2. If the value doesn't appear in the page source, what can I do to reach it?

user2901181
  • 343
  • 4
  • 17

3 Answers3

2

If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.

One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.

One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.

Community
  • 1
  • 1
Trevor Merrifield
  • 4,541
  • 2
  • 21
  • 24
0

1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.

2.For scraping dynamic content you should use selenium, here a tutorial

arcegk
  • 1,480
  • 12
  • 15
0
  1. for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
  2. you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.
blackmamba
  • 556
  • 3
  • 11