0

I am new to web scraping and trying to scrape this following website: https://www.epri.com/#/careers/list

I am trying to scrape using python. I have tried requests, PhantomJS, selenium chromedriver to get the html. But the html I get does not match the html I see while inspecting using google chrome.

I am very new to scraping and have minimal knowledge of html and almost no knowledge of JavaScript. My main dilemma is to get the html I see in google chrome, so that I can start scraping.

Thanks in advance!

Shiv Kumar
  • 1,034
  • 1
  • 9
  • 21
BGuha
  • 57
  • 1
  • 8
  • try this: [Python Selenium accessing HTML source](https://stackoverflow.com/questions/7861775/python-selenium-accessing-html-source) – Advay Umare Feb 01 '18 at 05:25
  • Read https://www.dataquest.io/blog/web-scraping-tutorial-python/ . That'll give you an idea about web scraping (with Python). – Sharad Feb 01 '18 at 05:33
  • 1
    Not all websites have static html content, which is probably what you are after. That website you have looks like some parts of it are generated and others are probably css. Try this question, https://stackoverflow.com/questions/8323728/scraping-dynamic-content-in-a-website – smac89 Feb 01 '18 at 05:35
  • Why don't Beautiful Soap ? – Vikas Periyadath Feb 01 '18 at 05:40
  • The page seems to make a request for `https://services.epri.com/api/page-data/reqs` which is the JSON that fills in the table of open positions. – Dan D. Feb 02 '18 at 08:19

3 Answers3

0

The first thing you should be looking for is DOM parsers. These help you treat DOM objects (like <body>, <head>, <img>, etc) like python objects. Python DOM parser

After doing this, you should make a program that gets the whole html, and then from python, with the DOM parser, grab the information you need. If you need to scrape different pages, like lots of links, you should store them in an array, get their HTMLs and repeat the process.

In this way you could get most of the information of any site. What you should do is reverse engineer how to get it.

Tom Piaggio
  • 650
  • 8
  • 27
0

urllib2 works well for this purpose. It is quite easy to use as well.

import urllib2
URL = 'https://www.epri.com/#/careers/list'
response = urllib2.urlopen(URL)
print "Output: \n\n\n\n", response.read()

For parsing the obtained HTML, you can use BeautifulSoup.

  • this doesnt work the way i want it to. The html generated is not the one i see in google chrome. The html is dynamically generated. – BGuha Feb 03 '18 at 01:16
-1

you can use pyquery which allows you to make jquery queries on xml documents.

pigletfly
  • 1,051
  • 1
  • 16
  • 32