-1

Suppose we have a block of lines as given example below:

<segment1>
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
    .
    .
</segment1>

<segment2>
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
    .
    .
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
</segment2>

<segment3>
    <element="1" prop="blah"/>
    <element="2" prop="blah"/>
    .
    .
</segment3>

Here for example segment 2 has duplicates which needs to be deleted(sorting doesn't matter here). So now how to bound sed to delete duplicated from segment 2 only. In this example segment 2 is the second segment which may not be the case for all possible cases which will be presented as it could be a subset of a subset too.

My thought on this is to use label, start at and end at with command gsed -ni 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

ErikMD
  • 13,377
  • 3
  • 35
  • 71
  • 3
    Standard advice: [Don't Parse XML/HTML with regular expressions](https://stackoverflow.com/q/1732348/9164010); rather use an XML parser, such as [DOM](https://en.wikipedia.org/wiki/Document_Object_Model), [SAX](https://en.wikipedia.org/wiki/Simple_API_for_XML), [StAX](https://en.wikipedia.org/wiki/StAX), or [XSLT](https://en.wikipedia.org/wiki/XSLT). – ErikMD Oct 27 '21 at 18:58
  • 2
    ... and an XSLT processor would be well suited to this task, whereas `sed` is not, even if we assume rigidly regular formatting of the XML input. – John Bollinger Oct 27 '21 at 19:08
  • 1
    For example, deleting *specifically* from segment 2 only could be easy with sufficient guarantees on the formatting of the input, but making `sed` figure out for itself that it needs to delete from segment 2, or which particular lines, would be very difficult if it is even possible at all. – John Bollinger Oct 27 '21 at 19:12
  • what if we bound the region using a start and end keyword, rather than following the formatted input? – Swarnim Lakra Oct 27 '21 at 19:27

1 Answers1

0

This might work for you (GNU sed):

sed -E '/<segment2>/,/<\/segment2>/{G;/^([^\n]*)(\n.*)*\n\1(\n|$)/!{P;h};d}' file

Use a range between <segment2> and </segment2>.

Append a copy of what has already been seen within the range to the current line and if not seen, print the current line and copy.

Otherwise, delete the line.

potong
  • 55,640
  • 6
  • 51
  • 83