I need to get the HTML with current styles (maybe inlined) of a page that finished rendering and finished running scripts, using a server side application which will be given just an URL (no extra information such as cookies, no POSTs, no impeding forms, etc.).
A bridge/proxy to a temporarily running browser or a stand alone utility using a browser library is an accepted solution (however, the chosen browser or browser library must be available on all major platforms, and must be able to run without an OS GUI beeing present or installed).
An optional requirement is to remove all scripts afterwards (there are already stand alone solutions for this, adding it here because maybe the given answer will be able to remove scripts while rendering or something like that).
How do I get a snapshot in HTML+CSS in a single .html file of the curent HTML document with the current styles (maybe inlined) and current images (using data URI)?
If it can be done using pure PHP it would be a plus (although I doubt it, I haven't found anything interesting).
Edit: I know how to load HTTP resources and get the HTML for an URL, that's not what I'm looking for ;)
Edit 2 Example input HTML:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<link rel="stylesheet" type="text/css" href="/css/example.css">
<script type="text/javascript" src="/javascript/example.js"></script>
<script type="text/javascript">
window.addEventListener("load",
function(event){
document.title="New title";
document.getElementById("pic_0").style.border="0px";
}
);
</script>
<style type="text/css">
p{
color: blue;
}
</style>
</head>
<body>
<p>Hello world!</p>
<p>
<img
alt=""
style="border: 1px"
id="pic_0"
src="http://linuxgazette.net/144/misc/john/helloworld.png"
>
</p>
</body>
</html>
Example output:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>New title</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<style type="text/css">
b{font-weight: bold}
</style>
<style type="text/css">
p{
color: blue;
}
</style>
</head>
<body>
<p>Hello world!</p>
<p>
<img
alt=""
style="border: 0px"
id="pic_0"
src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACgAAAAoBAMAAAB+0KVeAAAAK3RFWHRDcmVhdGlvbiBUaW1lAFYgMzEgYXVnLiAyMDEyIDE3OjU4OjU1ICswMjAwWMdbPwAAAAd0SU1FB9wIHw8ABeoUyU4AAAAJcEhZcwAACxIAAAsSAdLdfvwAAAAEZ0FNQQAAsY8L/GEFAAAABlBMVEX///8AAABVwtN+AAAAXklEQVR42uWQUQ6AMAhD6Q3a+19WqsawwMf+NLEfy3iDlC7idTGQp/YglFAsUMqSwjlQOhN3mIMTHDq70SeEWBbt0EG8POWkDySvmCh/SssvNfwIfb+hFmgjFKPf6gDQBAQ368m09AAAAABJRU5ErkJggg=="
>
</p>
</body>
</html>
Notice how the <title>
tag changed, how border: 1px
became border: 0px
, how the image URL was transformed into a data URI.
For example, some of these transformations (inline CSS and <title>
tag) can be observed when inspecting the document using the Google Chrome inspector.
Edit 3: Replacing external resources with on-page ones (styles and images) and removing javascript is an easy part. The hard part is computing the CSS style after running javascript.
Edit 4 Maybe this could be done using injected javascript (still need browser control though)?