It is nearly 3 years too late answering this question, but there are Perl regular expressions which can be indeed used for this task.
Finding and deleting a term
block containing same termName
in relation
as defined above for the term itself is possible with UltraEdit for Windows v21.10.0.1032 and most likely also with other text editors supporting Perl regular expression using a case-sensitive Perl regular expression Replace with search string:
^[ \t]*<term>(?:(?!</term>)[\S\s])+<termName>([^\r\n]+?)</termName>(?:(?!</term>)[\S\s])+<relation>(?:(?!</term>)[\S\s])+<termName>\1</termName>(?:(?!</term>)[\S\s])+</term>[ \t\r]*\n
The replace string is an empty string.
Explanation:
^
... start every search at beginning of a line.
[ \t]*
... there can be 0 or more spaces or tabs at beginning of the line.
<term>
... this string must be found next on the line.
Next the tricky expression follows which is required to match any character up to next string of interest, but with avoiding matching something in next term
block if the remaining expression does not return a positive result on current term
block.
(?:(?!</term>)[\S\s])+
... this expression finds any character because of [\S\s]
matching any non whitespace character or any whitespace character. There must be at least 1 character before next fixed string because of the +
, but it can be also more characters. Additionally the Perl regular expression must make look ahead on every character matched to check if NOT </term>
follows. If right of the currently matched character there is the string </term>
, the Perl regexp engine must stop matching any character at current position in stream and continue with next part of the search string. So this expression can match any character, but not beyond </term>
and therefore only characters between <term>
and </term>
. Because of ?:
nothing is captured/marked for back referencing by this expression.
<termName>
... this fixed string within a term
block must be found next.
([^\r\n]+?)
... matches the characters of the name of the term and captures/marks this string for back referencing. Instead of the negative character class expression [^\r\n]
, it would be also possible to use another class definition, or just .
if a dot does not match new line characters. Also possible would be ([^<]+)
if it is not possible that a not encoded opening angle bracket is part of the term name. Character <
must be encoded with <
according to XML specification within an element's value except within a CDATA block.
</termName>
... this fixed string within a term
block must be found next.
(?:(?!</term>)[\S\s])+
... again any character within a term
block up to next fixed string.
<relation>
... this fixed string within a term
block must be found next.
(?:(?!</term>)[\S\s])+
... again any character within a term
block up to next fixed string.
<termName>
... this fixed string within a term
block must be found next.
\1
... this expression back references the captured/marked term name and therefore the next string must be the same as the name of the term defined above.
</termName>
... this fixed string within a term
block must be found next.
(?:(?!</term>)[\S\s])+
... again any character within a term
block up to next fixed string.
</term>
... this fixed string marking end of a term
block must be found next.
[ \t\r]*\n
... matches 0 or more spaces, tabs and carriage returns and next a line-feed. So this expression works for a DOS/Windows (CR+LF) and a Unix (only LF) text file.
Also possible with UltraEdit is:
(?s)^[ \t]*<term>(?:(?!</term>).)+<termName>([^<]+?)</termName>(?:(?!</term>).)+<relation>(?:(?!</term>).)+<termName>\1</termName>(?:(?!</term>).)+</term>[ \t\r]*\n
(?s)
... this expression at beginning of the search string changes the behavior of .
from matching any character except line terminators to really any character and therefore .
is now like [\S\s]
.