Where to find entire HTML content in Chromium source code

Question

I am currently trying to do this: once the webpage loads, find out if the URL is of a certain pattern (say www.wikipedia.com/*), then, if so, parse the HTML content of that webpage like one can do with BeautifulSoup, and check if the webpage has a div with class foo and id boo. Any idea where can I writ this code, that is, where can I get access to URL, where do I need to listen to to know that the webpage has finished loading following which I can look for the URL and HTML content, and where and how I can parse the HTML?

I tried going through the code in src/chrome/browser/tab_contents, I could not find any reasonable place where I can do all this.

Chromium use multi-process architecture and rendering of web pages is done in render process. Once a page finishes loading, this method will be invoked: https://cs.chromium.org/chromium/src/content/renderer/render_frame_impl.cc?q=RenderFrameImpl::DidFinishLoad()&sq=package:chromium&g=0&l=4578. You might want to debug from there if you want to find out more info — Asesh, Aug 31 '18 at 05:10
Thanks Asesh. Any idea how can I access the HTML of the entire webpage? — SexyBeast, Aug 31 '18 at 10:31
Do you want to access the source code while debugging or via an API? — Asesh, Sep 04 '18 at 04:49
Depending on your needs a userscript for Tampermonkey might be enough: https://openuserjs.org/about/Tampermonkey-for-Chromium — Peter Bagyinszki, Sep 04 '18 at 10:28
Thanks guys. I was looking for a native way, via C++, not though script or extension. — SexyBeast, Sep 04 '18 at 10:56
@Asesh, I want a hook within the Chromium code, where I know that the page has been loaded, so the entire HTML is available, so that I can look for elements of interest, and if found, do certain things (like make a REST call to a server to store some data). Nothing related to debugging or API. — SexyBeast, Sep 04 '18 at 17:08
Hi looking for the same thing. did you find the solution? Thanks — אVי, Dec 17 '18 at 18:30
I have almost the same question for Android Chromium source code. Can anyone provide a link to the parser in the Java Code? Thank you. — Mathe Eliel, Mar 04 '20 at 08:47

score 5 · Accepted Answer · answered Sep 06 '18 at 21:00

Take a look at the following conceptual application layers which represent how Chromium displays web pages:

_{Image Source: https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit}

The different layers are described as:

WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.

Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).

Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.

WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.

Browser: Represents the browser window, it contains multiple WebContentses.

Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).

Since your goal is to access and interpret the HTML content of a web page by element and/or class, you can look to the rendering process which uses Blink:

The renderers use the Blink open-source layout engine for interpreting and laying out HTML.

Blink has a WebDocument class which allows you to access the HTML content and other properties of a web page:

WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();

Also remember, render processes run in a sandbox – Asesh Sep 07 '18 at 07:00 — Asesh, Sep 07 '18 at 07:00

score 2 · Answer 2 · answered Sep 07 '18 at 13:27

2

Cleanest would be via the chrome remote debugging protocol

Use the DOM methods to get the root DOM and walk, search, or query the dom

This would make testing simpler as well: you can implement the logic in your favourite scripting language using an existing client library (there are many) and once that works implement it in C++.

If this for some reason has to be inprocess within Chromium, as a next step start a thread that connects to this and performs the operations.

answered Sep 07 '18 at 13:27

carlsborg

2,628
19
21

Can you provide some example code, and explain where exactly it will fit, that I add a listener for the page-load complete, then get the element of my type, say a text-box with id `foo`? – SexyBeast Sep 07 '18 at 15:16

score 2 · Answer 3 · answered Sep 07 '18 at 15:32

You need to use a server side library to parse the contents of a requested HTML page. In Java for example there is a library "jsoup" there might be another alternatives for other server side languages. The main problem you could find is a "forbiden access", due to security restrictions, but as you are not trying to access REST services or similar things but only parse pure HTML to found string patterns, it must be easily done with "jsoup". There was a project where similar things were programmed for accessing web sites pages & parse the response html string.

Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"

See: https://jsoup.org/

Where to find entire HTML content in Chromium source code

3 Answers3