0

If a child of div matches to some string I want to remove the whole div. For example:

<div>
some text here
if this text is matched, remove whole div
some other text
</div>

I have to do this on many files so I'm looking for some Linux commands like sed.

Thank you for looking into this.

jww
  • 97,681
  • 90
  • 411
  • 885
Amol
  • 11
  • 2

3 Answers3

1

If I understood your question correctly then it can be achieved in one single sed command:

sed '/<div>/I{:A;N;h;/<\/div>/I!{H;bA};/<\/div>/I{g;/\bsome text here\b/Id}}' file.txt

Testing

Let's say this is your file.txt:

a. no-div text

<DIV>

some text here
1. if this text is matched, remove whole DIV
some other text -- WILL MATCH
</div>

<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>

<div>
Some TEXT Here
4. if this text is matched, remove whole DIV
foo bar foo bar - WILL MATCH
</DIV>

c. no-div text

Now when I run above sed command it gives this output:

a. no-div text


<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>


c. no-div text

As you can verify from above output that wherever the pattern some text here was matched between div tags those div blocks have been completely removed.

PS: I am doing case insensitive search here, if you don't need that behavior please let me know. I will just need to remove I switch from above sed commands.

anubhava
  • 761,203
  • 64
  • 569
  • 643
0

There's probably a better way to do this, but what I've done in the past is:

1) strip out newlines (because matching across lines is difficult at best and going backwards even worse)

2) parse

3) put newlines back in

cat /tmp/data | tr "\n" "@" | sed -e 's/<div>[^<]*some text here[^<]*<\/div>//g' | tr "@" "\n"

This is assuming that "@" can't appear in the file.

drysdam
  • 8,341
  • 1
  • 20
  • 23
0

You may use ed instead of sed. The ed command reads the entire file into memory and performs an in-place file edit (i.e. there will be no security backups).

htmlstr='
<see file.txt in answer by anubhava>
'
matchstr='[sS][oO][mM][eE]\ [tT][eE][xX][tT]\ [hH][eE][rR][eE]'
divstr='[dD][iI][vV]'
# for in-place file editing use "ed -s file" and replace ",p" with "w"
# cf. http://wiki.bash-hackers.org/howto/edit-ed
cat <<-EOF | sed -e 's/^ *//' -e 's/ *$//' -e '/^ *#/d' | ed -s <(echo "$htmlstr")
  H
  # ?re?   The previous line containing the regular expression re.  (see man ed)
  # '[[:<:]]' and '[[:>:]]' match the null string at the beginning and end of a word respectively. (see man re_format)
  #,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?,/<\/${divstr}>/d
  ,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?+0,/<\/${divstr}>/+0d
  ,p
  q
EOF
jeff
  • 1