0

Currently I am working on a PowerShell script that looks for tagging issues in XML files created by an authoring tool named Help and Manual.

As it sometimes happens, I encountered a little problem that I failed to solve on my own.

Let's imagine we have a string:

<para styleclass="Table Row Heading Text"><text style="font-size:12pt;">iso.outgoingQueueNameas</text></para>

What I want to do is to create a regular expression that will match <text style="font-size:12pt;">.\*</text> only if a string has <para styleclass="Table Row Heading Text"> at the beggining and the closing </para> tag at the end. To make the matter worse, apart <text style="font-size:12pt;">.*</text>, there could be any text inside the <para> element, like shown below:

<para styleclass="Table Row Heading Text">some text<text style="font-size:12pt;">iso.outgoingQueueNameas</text>some text</para>

I know that I can do some preliminary checks to find out if a string starts with <para styleclass="Table Row Heading Text"> and ends with <para>, and then use a relatively simple regular expression to get what I want, but I am really interested if it can be done solely by using a regular expression.

Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
  • 3
    [Don't parse X(HT)ML with regex](http://stackoverflow.com/a/1732454/5459839) – trincot Jan 20 '18 at 11:17
  • [It's pretty trivial](https://regex101.com/r/iBfxG3/2), but regexes just aren't meant for this... – GalAbra Jan 20 '18 at 11:33
  • @GalAbra, Thank you for the answer, but a regex must match only .* – Sergey Selyuto Jan 20 '18 at 11:37
  • You didn't mention what language you're using the regex with, so I'm not sure it's possible to use lookbehind. Also - the regex above returns a single capturing group – GalAbra Jan 20 '18 at 11:40
  • @GalAbra, Yes it is possible to use the lookbehind. Regex is used in a PowerShell script. You are right, it returns a capturing group, but then I have another question -- What if there are several instances of .* in a string? – Sergey Selyuto Jan 20 '18 at 11:44

3 Answers3

1

Unfortunately, you're asking how to screw in a light bulb with a hammer. You might be able to get the job done with the hammer, but it's more likely the bulb will end up shattered. You should should be asking what better tools there are for changing light bulbs.

/metaphor

You should probably be using XPathDocument and XPathExpression to test this XML fragment for the conditions you're looking for.

I've tossed the fragment you've shared along with some similar elements into a file xpathfragment.xml:

<?xml version="1.0"?><xml>
<para styleclass="NOT Table Row Heading Text">some text<text style="font-size:12pt;">iso.otherstuffthings</text>other text></para>
<para styleclass="Table Row Heading Text">some text<text style="font-size:12pt;">iso.outgoingQueueNameas</text>some text</para>
<para styleclass="Table Row Heading Text">some text<text style="font-size:18pt;">iso.outgoingQueueNameas</text>some text</para>
</xml>

The following PowerShell script does what I think you're trying to do:

find the inner-text of <text> elements having 'style' attribute equal to 'font-size:12pt', and whose immediate parent is a <para> element with 'styleclass' equal to 'Table Row Heading Text'

$filename = "c:\users\Username\Documents\xpathfragment.xml"
$xpDoc = [System.Xml.XPath.XPathDocument] $filename
$xpDocNavigator = $xpDoc.CreateNavigator()
$xpPathExpression = "/xml/para[@styleclass='Table Row Heading Text']/text[@style='font-size:12pt;']"

$xpDocNavigator.Evaluate($xpPathExpression)

This returns a single result from the test xml:

Value            : iso.outgoingQueueNameas
NodeType         : Element
LocalName        : text
NamespaceURI     : 
Name             : text
Prefix           : 
BaseURI          : file:///c:/users/Username/Documents/xpathfragment.xml
IsEmptyElement   : False
NameTable        : System.Xml.NameTable
HasAttributes    : True
HasChildren      : True
UnderlyingObject : iso.outgoingQueueNameas
LineNumber       : 3
LinePosition     : 53
IsNode           : True
XmlType          : 
TypedValue       : iso.outgoingQueueNameas
ValueType        : System.String
ValueAsBoolean   : 
ValueAsDateTime  : 
ValueAsDouble    : 
ValueAsInt       : 
ValueAsLong      : 
XmlLang          : 
SchemaInfo       : 
CanEdit          : False
OuterXml         : <text style="font-size:12pt;">iso.outgoingQueueNameas</text>
InnerXml         : iso.outgoingQueueNameas

The Value attribute iso.outgoingQueueNameas is, I think, what you wanted to find.

You'll need to fashion your xpath query to work within the context of the xml document you're using, but the above should be enough to get you started. You'll have a bit of learning curve picking up the xpath syntax, but in the end you'll have understanding of a tool that is much better suited to xml searching.

veefu
  • 2,820
  • 1
  • 19
  • 29
0

Parsing XML with regular expressions is bug prone and would give you problems in future. Use an XML parser parser OR validate it against schema, say DTD/XSD

gargkshitiz
  • 2,130
  • 17
  • 19
  • My question is not about parsing XML files. It is about a regular expression that can solve the problem I described. You can replace tags with any text suitable for you, if you don't like my XML example. – Sergey Selyuto Jan 20 '18 at 11:32
  • @SergeySelyuto Your question doesn't make sense with other text; or rather, it's underspecified. Also, which one do you care about: A solution to your particular problem (not necessarily using a regex), or a deeper understanding of regexes (not necessarily solving your particular problem)? – melpomene Jan 20 '18 at 11:47
  • @melpomene As I mentioned in my question, I am just wondering if the problem can be solved solely by using a regular expression. – Sergey Selyuto Jan 20 '18 at 11:52
  • @SergeySelyuto Isn't that just `.*.*.*`? – melpomene Jan 20 '18 at 11:53
  • @melpomene, it must match only .*, not the whole string. – Sergey Selyuto Jan 20 '18 at 11:57
  • @SergeySelyuto Then it depends on whether powershell supports variable-width look-behind. But why do you care about what the whole regex matches? You can capture the parts you're interested in separately. – melpomene Jan 20 '18 at 11:58
  • @melpomene, Yes, PowerShell supports the variable-width look-behind. I know I can capture parts I am interested in, but I am wondering if there is a regular expression that can match only `.*` instances in a string, provided it starts with `` and ends with ``. – Sergey Selyuto Jan 20 '18 at 12:14
0

Try using the following regex, then extract the capturing group using this answer

(?<=^<para styleclass="Table Row Heading Text">)(?:[^<]*)(<.*)(?=<\/para>)

It's going to capture all of the text between the first < after <para styleclass="Table Row Heading Text"> and </para> (not including these "edges").

Example input:

<para styleclass="Table Row Heading Text">some text<text style="font-size:12pt;">iso.outgoingQueueNameas</text><text style="font-size:12pt;">iso.outgoingQueueNameas</text></para>

Example capture:

<text style="font-size:12pt;">iso.outgoingQueueNameas</text><text style="font-size:12pt;">iso.outgoingQueueNameas</text>

GalAbra
  • 5,048
  • 4
  • 23
  • 42
  • 1
    Thank you for the answer again. I've never encountered this approach before. So, as I understand from the answers, this problem cannot be solved by using a single regular expression. – Sergey Selyuto Jan 20 '18 at 12:01
  • @SergeySelyuto As far as I know you can only use lookbehind if you know the length of the "looked-over" string. Because the length of the free text is unknown, I included it in the matched string - but not in the captured string. – GalAbra Jan 20 '18 at 12:03