0

Trying to user PowerShell to parse and scrape a webpage, a type of s/w inventory mgmt site we have, with below code

$test = Invoke-WebRequest -UseBasicParsing -Uri https://testuser -UseDefaultCredentials
$test.ToString() -split "[`r`n]" |
    Select-String "Usbser.sys" |
    ConvertFrom-StringData

It works except that I also need version of s/w I am searching for.

Current output:

Name                           Value                                                                                                                                                                                                
----                           -----                                                                                                                                                                                                
<td class                      "info">Usbser.sys </td

$test.ToString() gives me below data:

<td class="info">Usbser.sys </td>
        <td class="info">10.0.16299.334</td>

How do I crop off those tags from current output and also have it display the version info 10.0.16299.334?

Edit1: So I managed to find class as suggested by Lieven

className                    : 
id                           : installedSoftwareContainer
tagName                      : DIV
parentElement                : System.__ComObject
style                        : System.__ComObject
onhelp                       : 
onclick                      : 
ondblclick                   : 
onkeydown                    : 
onkeyup                      : 
onkeypress                   : 
onmouseout                   : 
onmouseover                  : 
onmousemove                  : 
onmousedown                  : 
onmouseup                    : 
document                     : mshtml.HTMLDocumentClass
title                        : 
language                     : 
onselectstart                : 
sourceIndex                  : 1496
recordNumber                 : 
lang                         : 
offsetLeft                   : 0
offsetTop                    : 0
offsetWidth                  : 0
offsetHeight                 : 0
offsetParent                 : 
innerHTML                    : 
                                   <table width="100%" border="1" cellspacing="0" cellpadding="5">
                                     <tbody><tr>
                                       <td class="caption">Name</td>
                                       <td class="caption">Version</td>
                                     </tr>
                                     <tr>
                                       <td class="info">1E NomadBranch x64</td>
                                       <td class="info">6.3.201</td>
    InnerText : 1E NomadBranch x646.3.201

but when I try the below code, I get nothing

$test = Invoke-WebRequest  -Uri https://testurl.com -UseDefaultCredentials 

$test.ParsedHtml.getElementbyid('installedsoftwarecontainer') | select innertext

What am I doing wrong?

Oxycash
  • 167
  • 12
  • 1
    `Select-String` allows you to select a number of lines preceding/following a match using the `-Context` parameter but I would try using the HTML parser for this. Something like `(Invoke-WebRequest https://testuser).ParsedHtml.getElementsByClassName('info')`might get you started. – Lieven Keersmaekers Jul 30 '19 at 06:09
  • 3
    As always, [do not parse HTML with regex](http://stackoverflow.com/a/1732454/1630171). – Ansgar Wiechers Jul 30 '19 at 07:28
  • *"Even Jon Skeet cannot parse HTML using regular expressions"* – Lieven Keersmaekers Jul 30 '19 at 07:49
  • @LievenKeersmaekers I tried but it wont work because its an internal site(developed badly I think), keeps returning no classes and only 4 tags, which are html,head,title,body. – Oxycash Jul 30 '19 at 08:57
  • 1
    Following returns all tags/classes/id's from you html. If that doesn't give you anything to go on, you'll have to post a html example. `ParsedHtml.getElementsByTagName('*') | Group-Object -Property tagName, ClassName, Id | foreach { $b = $_.name -split ', ' [pscustomobject] @{ tagName = $b[0]; ClassName = $b[1]; Id = $b[2] 'Count' = ($_.Group.Count) } } | Sort-Object tagName, ClassName, Id, Count` – Lieven Keersmaekers Jul 30 '19 at 09:03
  • 1
    @LievenKeersmaekers that finally worked!!! tagname=div, classname = installedsoftware, id = null, count = 1. Earlier, I kept trying with getelementbyID('installedsoftware') which returned a big empty space. – Oxycash Jul 30 '19 at 09:38
  • @JsJ - gtk - You can post (and accept) your final solution as an answer so others might benefit from it. – Lieven Keersmaekers Jul 30 '19 at 09:48
  • @LievenKeersmaekers Sure, I will. but I am still failing to get it to work. Not sure what I am doing wrong. check the edit1. – Oxycash Jul 30 '19 at 10:11
  • 1
    Use the [exact case](https://stackoverflow.com/a/1236864/52598): installedSoftwareContainer – Lieven Keersmaekers Jul 30 '19 at 10:21

1 Answers1

0

After a day of struggle, I found this and implemented the same on my code for splitting innertext and filter further.

powershell -split('') specify a new line

My Code:

     $test = Invoke-WebRequest  -Uri 'https:\\testURL.com' -UseDefaultCredentials 


    $data = $test.ParsedHtml.IHTMLDocument3_getElementById('installedSWContainer') |  select-object -ExpandProperty innertext

    $out = $data.Split([Environment]::NewLine) | Select-string -pattern "citrix"
Oxycash
  • 167
  • 12