RegEx to remove specific XML elements

Question

I'm using Kate to process text to create an XML file but I've hit a roadblock. The text now contains additional data that I need to remove based on its content.

To be specific, I have an XML element called <officers> that contains 0 or more <officer> elements, which contain further elements such as <title>, <name>, etc.. While I probably could exclude these at run time using XSL, the file also drives another process that I don't want to touch - it's a general purpose data importer for Scribus so I don't want to touch the coding.

What I want to do is remove an <officer> element if the <title> content isn't what I want. For example, I don't want the First VP, so I'd like to remove:

    <officer>
      <title>First VP</title>
      <incumbent>Joe Somebody</incumbent>
      <address>....</address>
      <address>....</address>
      ......
     </officer>

I don't know how many lines will be in any <officer> element nor what positions they will in within the <officers> element.

The easy part it getting to the start of the content I want removed. The hard part is getting to the </officer> end tag. All the solutions I've found so far just result in Kate deciding that the RegEx is invalid.

Any suggestions are appreciated.

Regular expressions are not the proper tool for non-trivial manipulations of XML (and XML-like) data - consider using a proper parser instead. — CertainPerformance, Aug 30 '18 at 04:04
have you gone to play somewhere like this: https://regexr.com/ :-D. Set it to PCRE to test your perl regex syntax and get some interactive feedback. — Mike M, Aug 30 '18 at 04:16
I'm not using perl. Perl's RegEx's don't directly translate in Kate's. — Gary Dale, Aug 30 '18 at 14:57

score 1 · Answer 1 · answered Aug 30 '18 at 07:27

1

Regex is the wrong tool for this job; never process XML without a proper parser, except possibly for a one-off job on a single document where you will throw the code away after running it and checking the results by hand. You might find a regex that works on one sample document, but you'll never get it to work properly on a well-designed set of 100 test documents.

And it's easily done using XSLT. It's a stylesheet with two template rules: a default "identity template" rule to copy elements unchanged, and a second rule to delete the elements you don't want. In fact in XSLT 3.0 it gets even simpler:

<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="officer[title='First VP']"/>

answered Aug 30 '18 at 07:27

Michael Kay

156,231
11
92
164

Thanks for your suggestions but I don't want to use XSL for this because that would require me to learn enough PHP or similar server-side language to get it to be able to save the updated file. My goal is to do as much as this as possible using simple RegExs to eventually script the entire process of transforming the initial text file into the XML I want. – Gary Dale Aug 30 '18 at 14:53
Sure, if you need to drive in a nail and don't want to learn how to use a hammer, then use the sole of your shoe; but don't expect professional advice from a carpenter. – Michael Kay Aug 30 '18 at 15:31
I'm not trying to start an argument, but RegExs are a great tool for manipulating text. I can start with a raw text file and manipulate it into well-formed and well-formatted XML. Asking that I now disrupt my workflow to set up a server so I can use a different tool for the last, relatively simple, edit is like telling a cabinet maker they need more lathe chisels. I prefer learning more about a tool that I use frequently instead of learning about a new tool that I am unlikely to use again. Thanks for taking the time to offer your advice but I really am looking for a RegEx solution. – Gary Dale Aug 30 '18 at 16:33
Do what you fancy; it's no skin off my back if your code doesn't work. Before you do it though, enjoy https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Michael Kay Aug 30 '18 at 21:50

RegEx to remove specific XML elements

1 Answers1