remove html tag if it contains some text inside

Question

If a child of div matches to some string I want to remove the whole div. For example:

<div>
some text here
if this text is matched, remove whole div
some other text
</div>

I have to do this on many files so I'm looking for some Linux commands like sed.

Thank you for looking into this.

Yeah don't use regular expressions for HTML, it'll go badly: http://stackoverflow.com/a/1732454/928098 — Kristian Glass, Apr 30 '12 at 01:21

anubhava · Answer 1 · 2011-04-26T18:44:41.637

If I understood your question correctly then it can be achieved in one single sed command:

sed '/<div>/I{:A;N;h;/<\/div>/I!{H;bA};/<\/div>/I{g;/\bsome text here\b/Id}}' file.txt

Testing

Let's say this is your file.txt:

a. no-div text

<DIV>

some text here
1. if this text is matched, remove whole DIV
some other text -- WILL MATCH
</div>

<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>

<div>
Some TEXT Here
4. if this text is matched, remove whole DIV
foo bar foo bar - WILL MATCH
</DIV>

c. no-div text

Now when I run above sed command it gives this output:

a. no-div text


<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>


c. no-div text

As you can verify from above output that wherever the pattern some text here was matched between div tags those div blocks have been completely removed.

PS: I am doing case insensitive search here, if you don't need that behavior please let me know. I will just need to remove I switch from above sed commands.

Hi @anubhava, your code looks awesome , could you explain it a little bit? For example, the :A command — Steven You, Mar 12 '13 at 07:39

score 0 · Answer 2 · answered Apr 22 '11 at 17:02

0

There's probably a better way to do this, but what I've done in the past is:

1) strip out newlines (because matching across lines is difficult at best and going backwards even worse)

2) parse

3) put newlines back in

cat /tmp/data | tr "\n" "@" | sed -e 's/<div>[^<]*some text here[^<]*<\/div>//g' | tr "@" "\n"

This is assuming that "@" can't appear in the file.

answered Apr 22 '11 at 17:02

drysdam

8,341
1
20
23

Yeah, don't use regular expressions for HTML, it'll go badly: http://stackoverflow.com/a/1732454/928098 – Kristian Glass Apr 30 '12 at 01:21

score 0 · Answer 3 · answered Apr 24 '11 at 15:12

You may use ed instead of sed. The ed command reads the entire file into memory and performs an in-place file edit (i.e. there will be no security backups).

htmlstr='
<see file.txt in answer by anubhava>
'
matchstr='[sS][oO][mM][eE]\ [tT][eE][xX][tT]\ [hH][eE][rR][eE]'
divstr='[dD][iI][vV]'
# for in-place file editing use "ed -s file" and replace ",p" with "w"
# cf. http://wiki.bash-hackers.org/howto/edit-ed
cat <<-EOF | sed -e 's/^ *//' -e 's/ *$//' -e '/^ *#/d' | ed -s <(echo "$htmlstr")
  H
  # ?re?   The previous line containing the regular expression re.  (see man ed)
  # '[[:<:]]' and '[[:>:]]' match the null string at the beginning and end of a word respectively. (see man re_format)
  #,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?,/<\/${divstr}>/d
  ,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?+0,/<\/${divstr}>/+0d
  ,p
  q
EOF

remove html tag if it contains some text inside

3 Answers3

Testing

Now when I run above sed command it gives this output: