3

I have this PS script it logins to a site and then it navigate's to another page.

I want to save whole source for that page. but for some reason. some parts of source code is not coming across.

$username = "myuser" 
$password = "mypass"
$ie = New-Object -com InternetExplorer.Application
$ie.visible=$true
$ie.navigate("http://www.example.com/login.shtml")
while($ie.ReadyState -ne 4) {start-sleep -m 100}
$ie.document.getElementById("username").value = "$username"
$ie.document.getElementById("pass").value = "$password"
$ie.document.getElementById("frmLogin").submit()
start-sleep 5
$ie.navigate("http://www.example.com/thislink.shtml")
$ie.Document.body.outerHTML | Out-File -FilePath c:\sourcecode.txt


Here is pastebin of code which is not coming across
http://pastebin.com/Kcnht6Ry

user206168
  • 1,015
  • 5
  • 20
  • 40

2 Answers2

3

After you navigate, check for the Ready State again instead of using a sleep. The same code that you had will work.

It appears after running the code, the sleep may not be long enough if the site is slow to load.

while($ie.ReadyState -ne 4) {start-sleep -m 100}

It also looks like there is another post regarding this innerHTML converts CDATA to comments It looks like some one created a function on that page where you can clean it up. It would be something like this once you have the function declared in your code

htmlWithCDATASectionsToHtmlWithout($ie.Document.body.outerHTML) | Out-File -FilePath c:\sourcecode.txt
Community
  • 1
  • 1
tkrn
  • 596
  • 1
  • 3
  • 17
  • sorry but page loads and everything. I have visible on. problem is it is ignoring codes after `//<![CDATA[` – user206168 Jun 11 '13 at 18:36
  • Thanks a lot. but I am still having getting error by using the function you posted. `At C:\Users\mmmm\Desktop\new.ps1:4 char:5 + var ATTRS = "(?:[^>\"\]|\"[^\"]*\"|\'[^\']*\')*",` – user206168 Jun 11 '13 at 19:13
  • Marking as solved, but still need to work on fixing error from that code. – user206168 Jun 17 '13 at 14:10
0

I agree with @tkrn regarding using the while loop to wait for IE document to be ready. And for that I recommend to use at least 2 seconds inside the loop.

while($ie.ReadyState -ne 4) {start-sleep -s 2}

Still I found an easier way to get the whole HTML source page exactly from the URL. Here it is:

$ie.Document.parentWindow.execScript("var JSIEVariable = new XMLSerializer().serializeToString(document);", "javascript")
$obj = $ie.Document.parentWindow.GetType().InvokeMember("JSIEVariable", 4096, $null, $ie.Document.parentWindow, $null)
$HTMLDoc = $obj.ToString()

Now, $HTMLDoc has the whole HTML source page intact and you can save it as html file.

  • Do you have any explanation for that "JSIEVariable" stuff? It works but I want to know why since I don't understand at all what's happening here. – Rakha May 17 '19 at 13:41