bash (sed or awk preferred) to remove everything between first and last instance

Question

I'm pretty familiar with sed but I don't know awk very well, and I'm not sure how to solve this problem. I've googled for a while but no luck so far. Here's the situation: I've got a big file with groups and sections, like so:

<A1>
  some nr of lines
</A1>
<A2>
  some nr
  of lines
</A2>
<B1>
  some
  nr of
  lines
</B1>
<B2>
  some nr of lines
</B2>
<B3>
  bla
</B3>
<C1>
  bla
</C1>
<C2>
  bla
</C2>

Now the problem is that the number of groups can change, the number of sections can change, and the number of lines in each section can change. For example, section A might go to 25, section B might go to 8, and so on. What I need to do is remove all entries of certain groups, in the example above I'd like to remove everything in <B*>, leaving me with the following:

<A1>
  some nr of lines
</A1>
<A2>
  some nr
  of lines
</A2>
<C1>
  bla
</C1>
<C2>
  bla
</C2>

Additionally, there would be several sections I would want to remove (although these can be in separate runs), for example if the file goes from A1 to R123, I'd want to remove B*, F*, M*, etc.

If something similar has already been asked and answered somewhere I apologize, I did try to find a solution before posting.

Thanks!

Rather than stick with tools poorly designed for your task, you might look into tools that actually are designed to work with XMLish data: http://stackoverflow.com/questions/91791/grep-and-sed-equivalent-for-xml-command-line-processing — Mark, Dec 10 '12 at 21:18

anubhava · Accepted Answer · 2012-12-10T21:16:51.143

6

Using sed:

sed '/<B1>/,/<\/B3>/d' infile

Which means find a range of text starting from <B1> and ending at </B3> and delete it from sed's output. (that means sed will print rest of file on stdout)

EDIT: This will also work for your case:

sed '/<B[0-9]*>/,/<\/B[0-9]*>/d'

edited Dec 10 '12 at 21:16

answered Dec 10 '12 at 20:59

anubhava

761,203
64
569
643

Thanks for the quick reply, unfortunately I thought of that, but I don't know many sections there are in each group. Sometimes it may just be B1 to B3, but other times it may go up into the dozens (or hundreds) – Martin Dec 10 '12 at 21:05
All you need is start text an end text for above sed command. However if to-be-deleted sections are scattered all over the input file at random places then I afraid you will need to repeat above sed command that many times. – anubhava Dec 10 '12 at 21:07
The problem is that I won't manually be looking up the end text. The file is thousands of lines long and modified regularly, and I need to generate a fixed list regularly. The sections should always be contiguous, but they will not be the same length. On average, I'm guessing the file will be ~4000 lines long, and I'm looking to remove a bit under half of it on a regular basis. – Martin Dec 10 '12 at 21:14
Hmm in that case you may need regex as well to define ranges (as long as it is in one cluster it will work file). Pls check my edited answer now. – anubhava Dec 10 '12 at 21:18
2

`sed '//,/<\/B[0-9]*>/d'` did it! Awesome thank you! – Martin Dec 10 '12 at 21:20
You need to anchor those expressions or it'll misfire if those patterns appear elsewhere in the text. It'll also fail if or it's mate appear without any digits so you need to change your RE for that too. – Ed Morton Dec 10 '12 at 21:27
@EdMorton: Pls check the question again: `I'd want to remove B*, F*, M*, etc` and since these are XML tags all together in a cluster I wouldn't worry about misfiring chances. – anubhava Dec 10 '12 at 21:31
1

@anubhava by those statements the OP would want to find B whether it's followed by digits or not but by his examples he wants at least one digit. Either way your RE is incorrect - it should either be or . Also, since it's free-form text between tags the only thing you can perhaps count on from his examples is that if appears in the free form text then it's indented so I think avoiding the trivial tweak of anchoring your REs with ^ and $ to protect from that case would not make sense. – Ed Morton Dec 10 '12 at 21:38

score 1 · Answer 2 · answered Dec 10 '12 at 21:24

I think what you're looking for is something like this:

awk -v rmv="AC" 'BEGIN{
   gsub(/./,"|&",rmv)
   sub(/$/,")[0-9]+>$",rmv)
   start = end = rmv
   sub(/^\|/,"^<(",start)
   sub(/^\|/,"^</(",end)
}
$0 ~ start { f=1 }
!f
$0 ~ end   { f=0 }
' file

Just populate the "rmv" variable with the list of all the sections you want removed:

$ awk -v rmv="B" '...'
<A1>
  some nr of lines
</A1>
<A2>
  some nr
  of lines
</A2>
<C1>
  bla
</C1>
<C2>
  bla
</C2>
$ awk -v rmv="AC" '...'
<B1>
  some
  nr of
  lines
</B1>
<B2>
  some nr of lines
</B2>
<B3>
  bla
</B3>
$

bash (sed or awk preferred) to remove everything between first and last instance

2 Answers2