4

In this html code :

<div id="ajaxWarningRegion" class="infoFont"></div>
  <span id="ajaxStatusRegion"></span>
  <form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
    <pre>
      Creating a new ZIP of IP Phone files from HTTP/PhoneBackup 
      and HTTPS/PhoneBackup
    </pre>
    <pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
    <pre>Reports Success</pre>
    <pre></pre>
    <a href =  /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
      Download the new ZIP of IP Phone files
    </a>
  </div>

I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip or just the date and hour between IP_PHONE_BACKUP- and .zip

How can I do that ?

Ocaso Protal
  • 19,362
  • 8
  • 76
  • 83
Littlefish
  • 45
  • 1
  • 1
  • 5
  • 2
    [Regular expressions are the wrong approach to parsing HTML (or XML)](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Richard Jul 25 '12 at 14:15
  • Richard, I'd disagree in this case. What they want to extract actually has nothing to do with HTML, XML or any other non-regular language. It's just a string from which they want to extract a date. – Joey Jul 26 '12 at 05:10

4 Answers4

10

What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:

Select-NodeContent $doc.DocumentNode "//a/@href"

And this one extracts the desired substring:

Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"

The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:

  • Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
  • Install PowerShell Community Extensions if you want to parse a live web page.
  • Understand XPath to be able to construct a navigable path to your target node.
  • Understand regular expressions to be able to extract a substring from your target node.

With those requirements satisfied you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent function, both shown below. The very end of the code shows how you assign a value to the $doc variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.

Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath

function Select-NodeContent(
    [HtmlAgilityPack.HtmlNode]$node,
    [string] $xpath,
    [string] $regex,
    [Object] $default = "")
{
    if ($xpath -match "(.*)/@(\w+)$") {
        # If standard XPath to retrieve an attribute is given,
        # map to supported operations to retrieve the attribute's text.
        ($xpath, $attribute) = $matches[1], $matches[2]
        $resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
    }
    else { # retrieve an element's text
        $resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.InnerText } { $default }
    }
    # If a regex is given, use it to extract a substring from the text
    if ($regex) {
        if ($text -match $regex) { $text = $matches[1] }
        else { $text = $default }
    }
    return $text
}

$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this  PSCX cmdlet to load a live web page
Michael Sorens
  • 35,361
  • 26
  • 116
  • 172
1

Actually, the HTML surrounding your file name is irrelevant here. You can extract the date just fine with the following regex (which doesn't even care whether you're extracting it from an e-mail an HTML page or a CSV file):

(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)

Quick test:

PS> [regex]::Match($html, '(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)')

Groups   : {2012-Jul-25_15:47:47}
Success  : True
Captures : {2012-Jul-25_15:47:47}
Index    : 391
Length   : 20
Value    : 2012-Jul-25_15:47:47
Joey
  • 344,408
  • 85
  • 689
  • 683
0

The group(2) and group(3) of the following regex receptively contains the date and time:

/IP_PHONE_BACKUP-((.*)_(.*)).zip/

Here is a link to extract the value from a regex in powershell.

Is there a shorter way to pull groups out of a Powershell regex?

HIH

Community
  • 1
  • 1
poussma
  • 7,033
  • 3
  • 43
  • 68
0

Without regex:

$a = '<div id="ajaxWarningRegion" class="infoFont"></div><span id="ajaxStatusRegion"></span><form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" ><pre>Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup</pre><pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre><pre>Reports Success</pre><pre></pre><a href =  /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>Download the new ZIP of IP Phone files</a></div>'
$a.Substring($a.IndexOf("IP_PHONE_BACKUP")+"IP_PHONE_BACKUP".length+1, $a.IndexOf(".zip")-$a.IndexOf("IP_PHONE_BACKUP")-"IP_PHONE_BACKUP".length-1)

Substring gets you a part of the original string. The first parameter is the start position of the substring while the second part is the length of the desiered substring. So now all you have to do is to calculate the start and the length using a little IndexOf- and Length-magic.

Ocaso Protal
  • 19,362
  • 8
  • 76
  • 83