0

First, I am not a programmer.

I have a huge XML file with terms described thus:

<term>
<termId>MANUAL000399</termId>
<termUpdate>Add</termUpdate>
<termName>care</termName>
<termType>Pt</termType>
<termStatus>Active</termStatus>
<termApproval>Approved</termApproval>
<termCreatedDate>20120618T14:38:20</termCreatedDate>
<termCreatedBy>admin</termCreatedBy>
<termModifiedDate>20120618T14:40:41</termModifiedDate>
<termModifiedBy>admin</termModifiedBy>
</term>

In the file, terms have either

<termType>

Pt or ND

I would like the solution to apply to both. what I would like to do is be able to go through, look at the word length in termName and if there are fewer than 5 characters in there, append another property, a

<termNote> 

in after the

<termModifiedBy> 

property:

<term>
<termId>MANUAL000399</termId>
<termUpdate>Add</termUpdate>
<termName>care</termName>
<termType>Pt</termType>
<termStatus>Active</termStatus>
<termApproval>Approved</termApproval>
<termCreatedDate>20120618T14:38:20</termCreatedDate>
<termCreatedBy>admin</termCreatedBy>
<termModifiedDate>20120618T14:40:41</termModifiedDate>
<termModifiedBy>admin</termModifiedBy>
<termNote label="Short">Short</termNote>
</term>

Can anyone advise what the best approach for this? I found regexes on here but the problem is the application of them, I found someone suggesting /\b[a-zA-Z]{5,}\b/ but I don't know how to write a script that takes this and then inserts the termNote if it matches.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
lobe
  • 31
  • 7
  • It is hard not to provide a link to here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Boris Stitnicky Sep 11 '12 at 10:50
  • what should I use instead of regex? As I said I am not a programmer and have no idea about these things. Thanks – lobe Sep 11 '12 at 11:01
  • I'm sorry that I'm not going to answer your question. But I can give a few comments. Firstly, if, as a non-programmer, you came as far as needing to do what you show here, then you need to become a programmer. Choose either Python or Ruby and learn it. Secondly, your question is not clear. You need to improve your text composition, and I'm sure XML guys out there will answer. Thirdly, do not parse XML with regexen unless you have a specific, known set of documents that happen to be able to be parsed by regex. Regex is not a golden hammer. – Boris Stitnicky Sep 11 '12 at 11:09

1 Answers1

0

This transformation can be done by a simple XSLT stylesheet. (XSLT is a language that non-programmers often take to more enthusiastically than programmers. A stylesheet is basically a set of transformation rules: when you see something that matches X, replace it by Y. Of course, once you have mastered XSLT, you can call yourself a programmer).

First some boilerplate:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/> <!-- removes whitespace from the input -->
<xsl:output indent="yes"/>      <!-- adds whitespace to the output -->

Then a default template rule that copies things unchanged if there's no more specific rule:

<xsl:template match="*">
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

Then a template rule that matches short terms:

<xsl:template match="term[string-length(termName) &lt; 5]">
  <term>
    <xsl:copy-of select="*"/>
    <termNote label="Short">Short</termNote>
  </term>
</xsl:template>

and then finish off with:

</xsl:stylesheet>

You should be able to run this with any XSLT processor; there are plenty available. If nothing else comes to mind, download KernowForSaxon (from SourceForge) which is a very simple GUI interface around my Saxon processor.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • wow that is fantastic, that's worked exactly! I can't tell you how grateful I am, thank you so much. – lobe Sep 11 '12 at 11:39