0

I'm implementing a python script mainly based on pyautogui. One of the things the script does is to open a chrome webpage. After that I would need to access the DOM of this currently open webpage. Since I've not opened the browser with selenium, I can't use it to analyze the DOM. However, my question is: is this currently open chrome page available/saved somewhere in the hard drive so that I can access it with selenium? Like an .html file? I checked many other questions here and users talk about chrome cache, but there are no html files there. I just need to be able to access the current open page and not all the historical data in the cache. Opening web browser directly with selenium is not an option either, since most of the websites analyzed have captchas and distil technology. Thanks.

Angelo
  • 1,594
  • 5
  • 17
  • 50
  • Ummm... what do you need exactly? Just the DOM, as html? You can get the source with `webdriver.page_source`, that you can save later. This will contain the currently rendered html (basically the same that you get with CTRL-U in any browser), but not the external resources (no css/js/pic/whataver files, just a reference to them) – skandigraun Sep 21 '18 at 16:28
  • @skandigraun I do need them DOM, but remember I don't open the chrome browser with selenium. – Angelo Sep 21 '18 at 16:39
  • why is this a selenium question if you don't want to use selenum. And no, modern browsers do not save current page on disk. But since it's unclear what sort of access you have to the browser (other than not using selenium), not sure we can help with so little details. – timbre timbre Sep 21 '18 at 16:49
  • @KirilS. Thanks Kiril. So let's go through this example. You have a website (purely as example) like streeteasy.com Now, as you as you try to access it with selenium, Distil Network will realize you're using selenium and you won't go far. Now, a workaround in my opinion is to open the browser without selenium, moving to a desired page with pyautogui, and then access the content of the page perhaps using selenium. But how can selenium access the content of a web page not opened with selenium? That's why I tagged selenium as a topic. – Angelo Sep 21 '18 at 16:59
  • Ahh, soon enough I will learn to read :D. Actually it is possible. If you start the original chrome with `--remote-debugging-port=PORT_NR` argument, and then visit http://localhost:PORT_NR from another browser, you will get access to all content of the original browser, including developer console. There are some API documentation available also (though very sparse), search for `chrome devtools protocol`. However visiting the port with selenium you should have a relatively easy time to do stuff. – skandigraun Sep 21 '18 at 17:06
  • take a look at this: https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver some sites are really primitive about selenium detection, so you can override it. Otherwise what skandigraun said, or using some add-on. I don't think you can connect to existing browser instance with selenium, or if you could, that would probably still trigger that site's defence... – timbre timbre Sep 21 '18 at 17:15
  • @skandigraun thank you! So I'm already checking online and do some research but you gave me a great clue. Apologizes for my current ignorance, I tried to run the following python code but it gives an error message: subprocess.call([r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe --remote-debugging-port=PORT_NR"]) --> error: [WinError 2] The system cannot find the file specified. Any chance you can tell what's wrong? – Angelo Sep 21 '18 at 17:17
  • @KirilS.Thank you Kirill. I read that page about 50 times :) and that's why I figured trying to win Distil probably is too big of a challange for me. So I was trying to find a way to locate elements on a webpage that I don't open with selenium :( – Angelo Sep 21 '18 at 17:19
  • Call it like `subprocess.call (["CHROME_EXECUTEABLE_PATH", "--remote-debugging-port=9222"])` - the argument should be a new array element. The PORT_NR should be an actual number. Then visit localhost:9222 with a new browser. – skandigraun Sep 21 '18 at 17:22
  • @skandigraun thank you so much, really appreciated. I did manage to launch subprocess.call succesfully. I opened for example streeteasy.com there. If you had an example of how I could retrieve any simple content at this point using 'http://localhost:9222/json' that would really make my day. Thanks again for all your help – Angelo Sep 21 '18 at 17:32
  • If you go without Selenium, you will need a websocket connection. Issue a GET request to /json, and grab the websocket url you need (webSocketDebuggerUrl). Open a websocket connection, and send a command, like `{"method": "DOM.getDocument", "id":12}`. The id is required, can be any integer. Also, has to be a valid JSON. Find available DOM related methods here: https://chromedevtools.github.io/devtools-protocol/1-3/DOM (If it helps at least a bit, will create an answer before the mods kill us for the million comments...) – skandigraun Sep 21 '18 at 17:49
  • 1
    @skandigraun I think you have me enough clues. If you create an answer I would be able to vote it and make sure you get rewarded for your help. Thanks again – Angelo Sep 21 '18 at 18:02
  • 1
    Oh yeah... sweet internet points, here I come :D – skandigraun Sep 21 '18 at 18:06

2 Answers2

2

If you start the original chrome with --remote-debugging-port=PORT_NR argument, and visit localhost:PORT_NR from another browser, you will have access to the full content of the browser, including dev console.

Once you have this, you have multiple ways to go:

  1. You can visit http://localhost:PORT_NR with with any other browser (or even with the same browser), and you should have full access to the content of the original Chrome. With Selenium you should have a relatively easy time to get by.

  2. You can also use the devtools api (the documentation.. is.. well... there is room for improvement. Search for chrome devtools protocol to be amazed by the lack of docs). As an example you can get to http://localhost:PORT_NR/json to get the available debugging URIs. Grab the relevant websocket endpoint (webSocketDebuggerUrl). Open a websocket connection, and issue a command, like {"method": "DOM.getDocument", "id":12}. You can find available DOM related commands here: https://chromedevtools.github.io/devtools-protocol/1-3/DOM

skandigraun
  • 741
  • 8
  • 21
0

Sice I had to reinvet the wheel I may give some extra info that I coudn't find anywhere:

  1. Start the Browser with remote debugging enabled (see previous posts)
  2. Connect to the given port on localhost and use these HTTP-GET-Requests to geta very limited control on your browser: https://chromedevtools.github.io/devtools-protocol/#endpoints

Most important:

  • GET /json/new?{url}
  • GET /json/activate/{targetId}
  • GET /json/close/{targetId}
  • GET /json or /json/list

To gain full control over the browser, you need to use a "websocket" connection. Each Object in the GET /json or /json/list has it's own ID. Use this ID to interact with the tab. Btw: Type "page" are normal tabs, the other stuff are extentions and so on. Once you know which Tab you want to influence, get it's "webSocketDebuggerUrl".

Use this URL and connect with something that can speak the Websocket-protocol.

Once connected, you must craft a valid Json by the following structure:

{
"id":0,
"method":"Page.navigate",
"params":{url:http://google.com}}
}

Notes: ID is a simple counter (int) that get bigger - not the ID of the tab(!) Method is the method described in the docs params is also in the docs.

The return values are always JSONs.

From now on you can use the official docs: https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-navigate

Dunno how other ppl found out about it but it took a few hours to get it working. Probably cause everyone is just using python's selenium to do it.