using AWK to remove characters match with html tag (not regex)

Question

I want to remove every html tag with awk from this regex: /[<.*.>]/ if said regex is found in any field. I've been trying to make it work with sub or substr, I am unable to find the correct logic for this.

Input text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation<br/><div style="margin-top:6px">< b>veniam:< /b>< /div> <br/><div style="margin-top:6px">< b>Confort:< /b></div>Comenzi volan; Cruise-control; Servodirectie; <br/>

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitationveniam: Confort:Comenzi volan; Cruise-control; Servodirectie;

`sed 's/<[^>]*>//g' file` is what you're looking for. But I'm sure this question has been asked at least thousand times here, if not more ;) — sjsam, Aug 24 '16 at 13:08
@TàiNguyễn Please check the duplicates pointed to by the other comments — sjsam, Aug 24 '16 at 13:12
Why are you so keen to use awk? It's not the right tool for the job. Try `php -R 'echo strip_tags($argn)."\n";' < file.html` (this example can be found in `man php`). — Tom Fenech, Aug 24 '16 at 13:14
this is valid HTML. It's not even valid XML. (space between `<` and `b>`. The only way to parse HTML is to sacrifice virgin kittens and using a dedicated HTML parser; regular expression engines like awk aren't equipped to properly lex/parse HTML. — Marcus Müller, Aug 24 '16 at 13:16
@MarcusMüller ignore the space between < and b> , let free to think that It a valid HTML, I'll edit for right HTML. I've read another suggestion but not relate to Awk. TomFenech: At the current project, I working with Awk to solve a thoundsands file, PHP in this situation is not good. — Tài Nguyễn, Aug 24 '16 at 13:25
@TàiNguyễn You **must not** try to parse valid HTML with AWK. AWK is not equipped for that. It's the wrong tool. — Marcus Müller, Aug 24 '16 at 13:27
feel free to think that I just get the string have some HTML tag, and I want to clean it by AWK,. The related post is not match my answer. Thanks — Tài Nguyễn, Aug 24 '16 at 13:48
exactly that's what the other post is about; understanding that it's not possible with regular expressions to understand the boundaries of a HTML tag. — Marcus Müller, Aug 24 '16 at 13:50

score 4 · Accepted Answer · answered Aug 24 '16 at 13:25

4

If you're not really parsing HTML but instead just want to remove everything between each <...> pair in a text file, then that'd be this with GNU awk for multi-char RS:

$ awk -v RS='<[^>]+>' -v ORS= '1' file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitationveniam: Confort:Comenzi volan; Cruise-control; Servodirectie;

answered Aug 24 '16 at 13:25

Ed Morton

188,023
17
78
185

Noooooo... comments in tags, multi-line tags, text properties containing special chars... Please don't tell people to parse HTML with AWK – Marcus Müller Aug 24 '16 at 13:27
1

Oh, It's work, many thank to @Ed Morton. I have try many regexp but can not. – Tài Nguyễn Aug 24 '16 at 13:28
@TàiNguyễn I'll repeat this: It works for your relatively simple example, but given arbitrary valid input, it will break. Ed correctly stresses that this is not parsing HTML! It can only work if you pre-sanitize your input, and that, without loss of generality, is equal to parsing the input as HTML first. – Marcus Müller Aug 24 '16 at 13:29
@EdMorton fair point; still OP's application is, I cite OP's comment: *At the current project, I working with Awk to solve a thoundsands file* <--- I personally (I know this can be different for different people) justify the assumption that all that input will be as simple as OP's example. – Marcus Müller Aug 24 '16 at 13:31
@EdMorton (the "fair point" wasn't referring to knee-jerking, but to the fact that you correctly pointed out this is not parsing HTML) – Marcus Müller Aug 24 '16 at 13:31
@EdMorton & Tài: `. -->>bold` as an example – Marcus Müller Aug 24 '16 at 13:34
@EdMorton could you explan for me what exactly /<[^>]+>/ match with HTML tag, sorry for this stupid question But regular expression is a interested thing. Thanks. – Tài Nguyễn Aug 24 '16 at 13:40
not "as desired", but "as specified". Pretty sure @TàiNguyễn would want to have `bold`, not `. -->>bold`. – Marcus Müller Aug 24 '16 at 13:40
@EdMorton the terrible thing is that I agree with you! The other terrible thing is that I assume that Tài doesn't realize he must guarantee that there's no nested `>` in his Tags – which he can't unless the thousands of files were generated by a controllable instance. – Marcus Müller Aug 24 '16 at 13:52
@EdMorton well, to be honest, this seems to be the core of what we can't agree on; just matching characters between `<>` seems a bit *too* fragile for what I'd expect in "normal" input (because that's much weaker than the HTML spec demands), while you, justifiably, argue that we can rely on the input being benign enough. In the end, it's up to Tài to decide which describes the input best – and obviously, something like "can I live with some misparsings that either solution will incur" is a factor, too. – Marcus Müller Aug 24 '16 at 15:47

using AWK to remove characters match with html tag (not regex)

1 Answers1