15

I'm trying to write a PowerShell script to get the text within all the classes named "newstitle" from a website.

This is what I have:

function check-krpano {
    $geturl=Invoke-WebRequest http://krpano.com/news/
    $news=$geturl.parsedhtml.body.GetElementsByClassName("newstitle")[0]
    Write-Host  "$news"
}

check-krpano

It obviously needs much more tweaking, but so far, it doesn't work.

I managed to write an script using GetElementById, but I don't know the syntax for GetElementsByClassName, and to be honest, I haven't been able to find much information about it.

NOTE:

I've ticked the right answer to my question, but that's not the solution that I had chose to use in my script.

Although I was able to find the content within a tag containing a certain class, using 2 methods, they were much slower that searching for links.

Here is the output using Measure-Command:

  • Search for divs containing class 'newstitle' using parsedhtml.body -> 29.6 seconds
  • Search for devs containing class 'newstitle' using Allelements -> 10.4 seconds
  • Search for links which its element 'href' contains #news -> 2.4 seconds

So I have marked as useful the Links method answer.

This is my final script:

function check-krpano {
    Clear-Host
    $geturl=Invoke-WebRequest http://krpano.com/news
    $news = ($geturl.Links |Where href -match '\#news\d+' | where class -NotMatch 'moreinfo+' )
    $news.outertext | Select-Object -First 5
}

check-krpano
RafaelGP
  • 1,749
  • 6
  • 20
  • 35
  • Your problem seems to be related to a certain PowerShell version as it works perfectly in PowerShell 5.1 ([see below](https://stackoverflow.com/a/61712844/11942268)). – stackprotector May 10 '20 at 13:32

5 Answers5

21

If you figure out how to get GetElementsByClassName to work, I'd like to know. I just ran into this yesterday and ran out of time so I came up with a workaround:

$geturl.ParsedHtml.body.getElementsByTagName('div') | 
    Where {$_.getAttributeNode('class').Value -eq 'newstitle'}
Keith Hill
  • 194,368
  • 42
  • 353
  • 369
  • 3
    Looks like a bug in `getElementsByTagName()` to me. However, I just came across [this answer](http://stackoverflow.com/a/9059206/1630171), which suggests something like this: `$geturl.AllElements | ? { $_.Class -eq 'newstitle' } | select innerText`. Might be a little more elegant. – Ansgar Wiechers Jul 13 '13 at 10:03
  • 1
    Good news is that it works with PowerShell v5. I came across this thread after my code broke running under PowerShell v4. – Robin Feb 03 '16 at 15:58
  • Is there a way to store one of the elements you receive back @AnsgarWiechers ? As in, if I get 5 elements back in my select list like you mentioned, and I want to "capture" it into an array how could I do this? – KangarooRIOT Jun 02 '17 at 14:38
20

getElementsByClassName does not return an array directly but instead a proxy to the results via COM. As you have discovered, conversion to an array is not automatic with the [] operator. You can use the list evaluation syntax, @(), to force it to an array first so that you can access individual elements:

@($body.getElementsByClassName("foo"))[0].innerText

As an aside, conversion is performed automatically if you use the object pipeline, e.g.:

$body.getElementsByClassName("foo") | Select-Object -First 1

It is also performed automatically with the foreach construct:

foreach ($element in $body.getElementsByClassName("foo"))
{
    $element.innerText
}
Don Cruickshank
  • 5,641
  • 6
  • 48
  • 48
  • Worked, i found it weird that gettype returned a com object. @($table)[1].outerHTML. You saved me a lot of time. – Ernesto Apr 07 '16 at 22:09
3

Cannot, for the life of me, get that method to work either!

Depending upon what you need back in the result though, this might help;

function check-krpano {
$geturl=Invoke-WebRequest http://krpano.com/news

$news=($geturl.Links|where href -match '\#news\d+')[0]

$news

}

check-krpano

Gives me back:

innerHTML : krpano 1.16.5 released
innerText : krpano 1.16.5 released
outerHTML : <A href="#news1165">krpano 1.16.5 released</A>
outerText : krpano 1.16.5 released
tagName   : A
href      : #news1165

You can use those properties directly of course, so if you only wanted to know the most recently released version of krpano, this would do it:

function check-krpano {
$geturl=Invoke-WebRequest http://krpano.com/news

$news=($geturl.Links|where href -match '\#news\d+')[0]

$krpano_version = $news.outerText.Split(" ")[1]

Write-Host $krpano_version

}

check-krpano

would return 1.16.5 at time of writing.

Hope that achieves what you wanted, albeit in a different manner.

EDIT:

This is a possibly a little faster than piping through select-object:

function check-krpano {
$geturl=Invoke-WebRequest http://krpano.com/news  

($geturl.Links|where href -match '\#news\d+'|where class -notmatch 'moreinfo+')[0..4].outerText  

}
Graham Gold
  • 2,435
  • 2
  • 25
  • 34
  • Thank you very much for your answer. It helped me to achieve what I was looking for! Although your script is not exactly what I asked, it's the fastest way to get the information, and I adapted my script inspired by yours. – RafaelGP Jul 16 '13 at 09:07
  • You're welcome, I know it doesn't use the `getElements..` methods of `ParsedHtml.body` but it is more efficient for your use case. I've edited my post with a modification to your script that may be just a little faster by accessing the first 5 array items directly without piping to select-object. Saved 0.5 - 1 second in my tests. – Graham Gold Jul 16 '13 at 19:15
  • Thanks for your help. Accessing the first 5 array items seems to be a little faster than using Select-Item :-) – RafaelGP Jul 22 '13 at 11:22
1

I realize this is an old question, but I wanted to add an answer for anyone else who might be trying to achieve the same thing by controlling Internet Explorer using the COM object like such:

$ie = New-Object -com internetexplorer.application
$ie.navigate($url)
while ($ie.Busy -eq $true) { Start-Sleep -Milliseconds 100; }

I normally prefer to use Invoke-WebRequest as the original poster did, but I've found cases where it seemed like I needed a full-fledged IE instance in order to see all of the JavaScript-generated DOM elements even though I would expect parsedhtml.body to include them.

I found that I could do something like this to get a collection of elements by a class name:

$titles = $ie.Document.body.getElementsByClassName('newstitle')
foreach ($storyTitle in $titles) {
     Write-Output $storyTitle.innerText
}

I observed the same really slow performance the original poster noted when using PowerShell to search the DOM, but using PowerShell 3.0 and IE11, Measure-Command shows that my collection of classes is found in a 125 KB HTML document in 280 ms.

terafl0ps
  • 684
  • 5
  • 8
0

It seems to work with PowerShell 5.1:

function check-krpano {
    $geturl = Invoke-WebRequest -Uri "http://krpano.com/news/"
    $news = $geturl.ParsedHtml.body.getElementsByClassName("newstitle")
    Write-Host "$($news[0].innerHTML)"
}

check-krpano

Output:

<A href="#news1206">krpano 1.20.6</A><SPAN class=smallcomment style="FLOAT: right"><A href="https://krpano.co
m/forum/wbb/index.php?page=Thread&amp;postID=81651#post81651"><IMG class=icon16m src="../design/ico-forumlink
.png"> krpano Forum Link</A></SPAN>
stackprotector
  • 10,498
  • 4
  • 35
  • 64