Scrape currently opened webpage or get live HTML with another method?

Question

I need to get a bit of data from a HTML tag that only appears when you're signed into a site. I need to do it in either Python or Javascript. Javascript has the Cross-Origin-Browser-Policy(CORS) as a obstacle.

I can't use server-side code. I can't use iframes.

The data is readily available if you open the page URL in Chrome or FireFox because it keeps you signed in, much like Facebook, so we'll use it as an example. We'll say I want to get the data from the first element of my Facebook news feed.

I've tried scraping the webpage and passing in the User Agent value with Pythons urllib module. I've tried using Yahoos YQL tool with Javascript. Both returned the HTML I wanted without the values I need in them. This is because it's not using my browsers to do it, which has the cookies stored required to populate the values I need.

So is there a way to scrape a webpage that's already open? Say I had Facebook open and I ran some code that got my news feed data from the browser.

Is there some other method I haven't mentioned to accomplish this?

Background: I'm creating an autobumper for a forum(within the site rules) and need some generated values from the site HTML, but will get no cooperation towards that end from the owner.

It's entirely possible for server-side code to support cookies and thus multi-page sessions including a login flow. — ceejayoz, Oct 30 '16 at 02:07
@ceejayoz If I absolutely cannot do it how I described I might resort to something like that. How might that be done? — user3055938, Oct 30 '16 at 02:10

score 0 · Answer 1 · answered Oct 30 '16 at 03:51

0

You can try the following with python selenium webdriver as it allows you to log in and get html source.

you will have to pip install selenium first and download the chromedriver.exe from selenium website http://docs.seleniumhq.org/

here is a sample code i use on gmail:

from selenium import webdriver

#you have to download the chromedriver from selenium hq homepage
chromedriver_path = r'your chromedriver.exe path here'
#create webdriver object and get url
driver = webdriver.Chrome(chromedriver_path)
driver.implicitly_wait(1)
driver.get('https://www.google.com/gmail')

#login
driver.find_element_by_css_selector('#Email').send_keys('email@gmail.com')
driver.find_element_by_css_selector('#next').click()
driver.find_element_by_css_selector('#Passwd').send_keys('1234')
driver.find_element_by_css_selector('#signIn').click()


#get html
html = driver.page_source

answered Oct 30 '16 at 03:51

foonspeed

1
2

That's pretty awesome. However it requires me to log the user in, which I don't want to handle and there's two-factor authentication as well so I don't know if it'd work anyway. Is there a way to do that, but use a browser with the cookies intact? That way a user can log in to their browser and not put their details into the program. Or does that require me to log in the user at all? Sounds like it's possible it uses google chrome, maybe within the driver. – user3055938 Oct 31 '16 at 15:01
I am not very familiar with this but you can try using selenium to get the url and then manually log in the 2 factor authentication. after that you can try to save the cookies. – foonspeed Nov 01 '16 at 15:22
[link](http://stackoverflow.com/questions/15058462/how-to-save-and-load-cookies-using-python-selenium-webdriver) – foonspeed Nov 01 '16 at 15:23

Scrape currently opened webpage or get live HTML with another method?

1 Answers1