How to delete duplicate lines from a block using sed

Question

Suppose we have a block of lines as given example below:

<segment1>
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
    .
    .
</segment1>

<segment2>
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
    .
    .
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
</segment2>

<segment3>
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
    .
    .
</segment3>

Here for example segment 2 has duplicates which needs to be deleted(sorting doesn't matter here). So now how to bound sed to delete duplicated from segment 2 only. In this example segment 2 is the second segment which may not be the case for all possible cases which will be presented as it could be a subset of a subset too.

My thought on this is to use label, start at and end at with command gsed -ni 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

Standard advice: [Don't Parse XML/HTML with regular expressions](https://stackoverflow.com/q/1732348/9164010); rather use an XML parser, such as [DOM](https://en.wikipedia.org/wiki/Document_Object_Model), [SAX](https://en.wikipedia.org/wiki/Simple_API_for_XML), [StAX](https://en.wikipedia.org/wiki/StAX), or [XSLT](https://en.wikipedia.org/wiki/XSLT). — ErikMD, Oct 27 '21 at 18:58
... and an XSLT processor would be well suited to this task, whereas `sed` is not, even if we assume rigidly regular formatting of the XML input. — John Bollinger, Oct 27 '21 at 19:08
For example, deleting *specifically* from segment 2 only could be easy with sufficient guarantees on the formatting of the input, but making `sed` figure out for itself that it needs to delete from segment 2, or which particular lines, would be very difficult if it is even possible at all. — John Bollinger, Oct 27 '21 at 19:12
what if we bound the region using a start and end keyword, rather than following the formatted input? — Swarnim Lakra, Oct 27 '21 at 19:27

score 0 · Answer 1 · answered Oct 27 '21 at 19:30

0

This might work for you (GNU sed):

sed -E '/<segment2>/,/<\/segment2>/{G;/^([^\n]*)(\n.*)*\n\1(\n|$)/!{P;h};d}' file

Use a range between <segment2> and </segment2>.

Append a copy of what has already been seen within the range to the current line and if not seen, print the current line and copy.

Otherwise, delete the line.

answered Oct 27 '21 at 19:30

potong

55,640
6
51
83

works on x64 but not on arm64(busybox) – Swarnim Lakra Oct 28 '21 at 10:20

How to delete duplicate lines from a block using sed

1 Answers1