0

I am a regex/powershell beginner and struggling to get this working. I am working with some HTML data and I need to be able to extract the string between given characters. In the below case, I need to extract the string (if it matches my search string) which is found between the characters > and <. I have provided multple examples here and I hope I made my question clear. Any help is greatly appreciated.

For example -

$string1 = '<P><STRONG><SPAN style="COLOR: rgb(255,0,0)">ILOM 2.6.1.6.a <BR>BIOS vers. 0CDAN860 <BR>LSI MPT SAS firmware MPT BIOS 1.6.00</SPAN></STRONG></P></DIV></TD>'

$string2 = '<P><A id=T5220 name=T5220></A><A href="http://mywebserver/index.html">Enterprise T5120 Server</A> <BR><A href="http://mywebserver/index.html">Enterprise T5220 Server</A></P></DIV></TD>'


$searchstring = "ILOM"
$regex = ".+>(.*$searchstring.+)<" # Tried this
$string1 -match $regex
$matches[x] = ILOM 2.6.1.6.a  # expected result    

Similarly -

$searchstring = "BIOS"
$regex = ".+>(.*$searchstring.+)<" # Tried this
$string1 -match $regex
$matches[x] = BIOS vers. 0CDAN860  # expected result

$searchstring = "T5120"
$regex = ".+>(.*$searchstring.+)<" # Tried this
$string2 -match $regex
$matches[x] = Enterprise T5120 Server   # expected result

$searchstring = "T5220"
$regex = ".+>(.*$searchstring.+)<" # Tried this
$string2 -match $regex
$matches[x] = Enterprise T5220 Server  # expected result
gbabu
  • 1,088
  • 11
  • 15

1 Answers1

1

You need to add the lazy ? operator(? qualifier?) on the "wildcard" after your searchstring to make it stop at the first occurence of <.

.*< = Any character as many as possible until an <

.*?< = Any character until first <

I would use the lazy operator on the "wildcard" before your searchstring too just to be safe even though it isn't necessary in this particular situation.

The minimum required modification:

".+>(.*$searchstring.+?)<"

I would recommend:

".+>(.*?$searchstring.+?)<"

Sample:

$string1 = '<P><STRONG><SPAN style="COLOR: rgb(255,0,0)">ILOM 2.6.1.6.a <BR>BIOS vers. 0CDAN860 <BR>LSI MPT SAS firmware MPT BIOS 1.6.00</SPAN></STRONG></P></DIV></TD>'

$string2 = '<P><A id=T5220 name=T5220></A><A href="http://mywebserver/index.html">Enterprise T5120 Server</A> <BR><A href="http://mywebserver/index.html">Enterprise T5220 Server</A></P></DIV></TD>'


$searchstring = "ILOM"
$regex = ".+>(.*?$searchstring.+?)<"
if($string1 -match $regex) { $matches[1] }

#Custom regex
$searchstring = "BIOS"
$regex = ".+>($searchstring.+?)<"
if($string1 -match $regex) { $matches[1] }

#Or the original regex with different search string
$searchstring = "BIOS vers"
$regex = ".+>(.*?$searchstring.+?)<"
if($string1 -match $regex) { $matches[1] }

$searchstring = "T5120"
$regex = ".+>(.*?$searchstring.+?)<"
if($string2 -match $regex) { $matches[1] }

$searchstring = "T5220"
$regex = ".+>(.*?$searchstring.+?)<"
if($string2 -match $regex) { $matches[1] }

Output:

ILOM 2.6.1.6.a 
BIOS vers. 0CDAN860 
BIOS vers. 0CDAN860 
Enterprise T5120 Server
Enterprise T5220 Server
Frode F.
  • 52,376
  • 9
  • 98
  • 114
  • Thanks @Frode F. However it doesn't work with my second example. I was wanting it to match with "BIOS vers. 0CDAN860". – gbabu Mar 09 '15 at 21:54
  • That requires a different regex or searchstring. updated with 2 BIOS examples now. – Frode F. Mar 10 '15 at 08:57