Remove html tag using regex with sed

Question

Say,

I have a html file from Word (DOCX) generated by soffice --headless command. Then I did tidy command so that the html to looks clean by removing unnecessary html/css cosmetics from Word.

I see something like

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

<p lang="en-US" class="western c31"></p>

... repeated 15 times

I did these command: sed -e 's/<(.*?)><\/(.?)>//g' > ./hasil.html sed -e 's/<[a-z] lang="(.*) class="western (.*?)><\/[a-z]>//g' > ./hasil.html

It doesn't work as expected to remove <p lang="en-US" class="western c31"></p> from HTML file.

I tried this link or this link, but doesn't help either.

Any help would be appreciate. Thank you.

mandatory [don't use regex to parse html](http://stackoverflow.com/a/1732454/7552) link. — glenn jackman, Oct 19 '15 at 23:08
@glenjackman thank you for reminding me. almost forgot about that. shame on me. — agungandika, Oct 19 '15 at 23:35
Isn't it sad? [11 Questions with the phrase "remove html tags" in title and sed in content](http://stackoverflow.com/search?q=sed+title%3Aremove+title%3Ahtml+title%3Atags+is%3Aquestion). Never mind... From the linked questions I guess you get some error message? If so, please add it to your question. — try-catch-finally, Oct 19 '15 at 23:39
@try-catch-finally i never done that before you tought me the 'advanced search' on stackoverflow. thank you. next time I will give better shot to search other questions before post the same question. — agungandika, Oct 20 '15 at 04:52

John1024 · Accepted Answer · 2015-10-20T07:17:49.643

0

All of sed's regular expressions are look for the (left-most) longest match. Perl and others may support the form .*? for non-greedy regexes but sed doesn't.

If you want to delete those lines, try:

sed '\|<p lang="en-US" class="western c31"></p>|d' hasil.html

d is sed's delete command.

If you want to use a substitute command to remove only those tags, leaving behind whatever else, if anything, was on the line:

sed 's|<p lang="en-US" class="western c31"></p>||g' hasil.html

edited Oct 20 '15 at 07:17

answered Oct 19 '15 at 23:00

John1024

109,961
14
137
171

To be precise, POSIX regex finds the left most longest match. There is no concept of "greedy" or "lazy" in POSIX regex, save for the similarity in simple cases. – nhahtdh Oct 20 '15 at 04:57
@nhahtdh Please clarify. In the context of regular expressions, what distinction are you making between (a) "greedy" and (b) "the left most longest match"? – John1024 Oct 20 '15 at 05:14
1

@John1024: The concept of "greedy" only exists for quantifiers in backtracking engine, where the search order of quantifier prefers repetition over moving on to the sequel. A pattern with greedy quantifier may not find the longest match. On the other hand, "left most longest match" describes the contract of POSIX regex, where all possibilities of the regex must be exhausted and only the longest match is returned. – nhahtdh Oct 20 '15 at 06:00
@nhahtdh OK. Interesting. Can you provide an sed-compatible example of a regex where the two concepts yield different results? – John1024 Oct 20 '15 at 06:39
1

Compare `sed -e 's/\(a\{5,7\}\)*/X/g'` and [`(a{5,7})*`](https://regex101.com/r/jU7pT3/1) on the string `aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa`. – nhahtdh Oct 20 '15 at 07:02

Remove html tag using regex with sed

1 Answers1