Deleting duplicate values using find and replace in a text editor

Question

I messed something up. In my xml, each non preferred term has a preferred term to use: Something I have done has created some non preffered terms where the preferred term to use is the exact same name as this non preferred term.

<term>
<termId>127699289611384833453kNgWuDxZEK37Lo4QVWZ</termId>
<termUpdate>Add</termUpdate>
<termName>Adenosquamous Carcinoma</termName>
<termType>Nd</termType>
<termStatus>Active</termStatus>
<termApproval>Approved</termApproval>
<termCreatedDate>20110704T09:41:31</termCreatedDatae>
<termCreatedBy>admin</termCreatedBy>
<termModifiedDate>20110704T09:45:17</termModifiedDate>
<termModifiedBy>admin</termModifiedBy>
<relation>
  <relationType>USE</relationType>
  <termId>1276992897N1537166632rbr7BISWAI93SarY118G</termId>
  <termName>Adenosquamous Carcinoma</termName>
</relation>

Is there a text editor with a find and replace function I can use to tell it that if the in =the of the actual term, to just delete the whole ? I looked at the related queries and they mentioned regular expressions, but I've spent ages trying to build them and they are beyond me, thanks!

Read your post 3 times now and I don't get what you would like to achieve. Can you add an "after" listing. Which OS are you on? The "tell it that if the in =the" part confuses me... — Fredrik Pihl, Jul 06 '11 at 11:04
Sorry, about this. A non preferred term should suggest a preferred term with a different name. It does this in and then specifies the id and name of the preferred term to use. In the example above, the xml is telling the system to use the same name for the preferred term as the non preferred term. So the find and replace would go through, find where the value of these two properties was the same, and, where it was, delete the whole term. So in the eg above, the whole term would be deleted. If the value in termName in relation was different,nothing would be changed. — Charlie, Jul 06 '11 at 11:27
I am on windows, though I could use a mac if required. So, in the above, the whole thing would be delteted because the termName in relation = the termName in . If they were different, nothing would be changed. — Charlie, Jul 06 '11 at 11:29

score 0 · Answer 1 · answered May 25 '14 at 13:51

It is nearly 3 years too late answering this question, but there are Perl regular expressions which can be indeed used for this task.

Finding and deleting a term block containing same termName in relation as defined above for the term itself is possible with UltraEdit for Windows v21.10.0.1032 and most likely also with other text editors supporting Perl regular expression using a case-sensitive Perl regular expression Replace with search string:

^[ \t]*<term>(?:(?!</term>)[\S\s])+<termName>([^\r\n]+?)</termName>(?:(?!</term>)[\S\s])+<relation>(?:(?!</term>)[\S\s])+<termName>\1</termName>(?:(?!</term>)[\S\s])+</term>[ \t\r]*\n

The replace string is an empty string.

Explanation:

^ ... start every search at beginning of a line.

[ \t]* ... there can be 0 or more spaces or tabs at beginning of the line.

<term> ... this string must be found next on the line.

Next the tricky expression follows which is required to match any character up to next string of interest, but with avoiding matching something in next term block if the remaining expression does not return a positive result on current term block.

(?:(?!</term>)[\S\s])+ ... this expression finds any character because of [\S\s] matching any non whitespace character or any whitespace character. There must be at least 1 character before next fixed string because of the +, but it can be also more characters. Additionally the Perl regular expression must make look ahead on every character matched to check if NOT </term> follows. If right of the currently matched character there is the string </term>, the Perl regexp engine must stop matching any character at current position in stream and continue with next part of the search string. So this expression can match any character, but not beyond </term> and therefore only characters between <term> and </term>. Because of ?: nothing is captured/marked for back referencing by this expression.

<termName> ... this fixed string within a term block must be found next.

([^\r\n]+?) ... matches the characters of the name of the term and captures/marks this string for back referencing. Instead of the negative character class expression [^\r\n], it would be also possible to use another class definition, or just . if a dot does not match new line characters. Also possible would be ([^<]+) if it is not possible that a not encoded opening angle bracket is part of the term name. Character < must be encoded with < according to XML specification within an element's value except within a CDATA block.

</termName> ... this fixed string within a term block must be found next.

(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.

<relation> ... this fixed string within a term block must be found next.

(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.

<termName> ... this fixed string within a term block must be found next.

\1 ... this expression back references the captured/marked term name and therefore the next string must be the same as the name of the term defined above.

</termName> ... this fixed string within a term block must be found next.

(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.

</term> ... this fixed string marking end of a term block must be found next.

[ \t\r]*\n ... matches 0 or more spaces, tabs and carriage returns and next a line-feed. So this expression works for a DOS/Windows (CR+LF) and a Unix (only LF) text file.

Also possible with UltraEdit is:

(?s)^[ \t]*<term>(?:(?!</term>).)+<termName>([^<]+?)</termName>(?:(?!</term>).)+<relation>(?:(?!</term>).)+<termName>\1</termName>(?:(?!</term>).)+</term>[ \t\r]*\n

(?s) ... this expression at beginning of the search string changes the behavior of . from matching any character except line terminators to really any character and therefore . is now like [\S\s].

Deleting duplicate values using find and replace in a text editor

1 Answers1

Linked