0

How can I parse a string and extract the text from in between a bracket syntax ( like <ID>12321</ID> I am working with the following string format:

<ID>298343</ID><TITLE>This is free text-that may contain / any character . . . <TITLE><ID>192723</ID><TITLE>Loreum Ipsum<TITLE><ID>298343</ID><TITLE>Thanks for help<TITLE><ID>192723</ID><TITLE>Strings are hard<TITLE>

Ideally, I'd like to read all of the digits as text into an array of strings. There may be up to a hundred ID values in these stings.


Yes, I am starting with XML. I am working with a format that gives me that markup:

[xml]$XMLResponse = $webclient.DownloadString("http://url.com/file.xml")
Write-Output $XMLResponse."string".ChildNodes

The output is:

Name            : #text
LocalName       : #text
NodeType        : Text
ParentNode      : string
Value           : <ID>298343</ID><TITLE>This is free text-that may contain / any character . . . <TITLE><ID>192723</ID><TITLE>Loreum Ipsum<TITLE><ID>298343</ID><TITLE>Thanks for help<TITLE><ID>192723</ID><TITLE>Strings are hard<TITLE>
InnerText       : <ID>298343</ID><TITLE>This is free text-that may contain / any character . . . <TITLE><ID>192723</ID><TITLE>Loreum Ipsum<TITLE><ID>298343</ID><TITLE>Thanks for help<TITLE><ID>192723</ID><TITLE>Strings are hard<TITLE>
Data            : <ID>298343</ID><TITLE>This is free text-that may contain / any character . . . <TITLE><ID>192723</ID><TITLE>Loreum Ipsum<TITLE><ID>298343</ID><TITLE>Thanks for help<TITLE><ID>192723</ID><TITLE>Strings are hard<TITLE>
Length          : 2464
PreviousSibling : 
NextSibling     : 
ChildNodes      : {}
Attributes      : 
OwnerDocument   : #document
FirstChild      : 
LastChild       : 
HasChildNodes   : False
NamespaceURI    : 
Prefix          : 
IsReadOnly      : False
OuterXml        : &lt;ID&gt;298343&lt;/ID&gt;&lt;TITLE&gt;This is free text-that may contain / any character . . . &lt;TITLE&gt;&lt;ID&gt;192723&lt;/ID&gt;&lt;TITLE&gt;Loreum Ipsum&lt;TITLE&gt;&lt;ID&gt;298343&lt;/ID&gt;&lt;TITLE&gt;Thanks for help&lt;TITLE&gt;&lt;ID&gt;192723&lt;/ID&gt;&lt;TITLE&gt;Strings are hard&lt;TITLE&gt;
InnerXml        : 
SchemaInfo      : System.Xml.Schema.XmlSchemaInfo
BaseURI         : 
Tartrantism
  • 17
  • 2
  • 6
  • Any chance your document is XML? That'd make the operation fairly simple. – Jason Morgan Apr 07 '14 at 22:35
  • It starts as XML then I'm left with a string node. Am I handeling it wrong?[xml]$XMLResponse = $webclient.DownloadString("http://www.url.com/file.xml") $XMLResponse."string".ChildNodes – Tartrantism Apr 09 '14 at 13:01

1 Answers1

1

Are all of the ID values six digits long? You can use a regular expression for this.

$Text = '<ID>298343</ID><TITLE>This is free text-that may contain / any character . . . <TITLE><ID>192723</ID><TITLE>Loreum Ipsum<TITLE><ID>298343</ID><TITLE>Thanks for help<TITLE><ID>192723</ID><TITLE>Strings are hard<TITLE>';
$MatchList = ([Regex]'(?<=<ID>)(\d{6})(?=</ID>)').Matches($Text);
$MatchList.Value;

Result:

298343
192723
298343
192723

NOTE: Duplicate values exist, because of your source text.

  • 1
    If it's variable you could always use a range operator and just keep the "", "" as your boundaries. – Jason Morgan Apr 07 '14 at 22:37
  • 1
    Seems like a valid answer in this case. I'm no SME, but Google quickly turned up (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) which cautions against the general use of RegEx as [X|HT]ML parser. Worth being aware of perhaps. – andyb Apr 07 '14 at 22:46