shell html tag remove duplicates

Question

I need to clean the unnecessary Italics tags in HTML, in any case :)

here my code:

text <i>text</i> <i>text text</i> text
text <i>text</i><i> text text</i> text
text<i>text </i>text text<i>text text</i> text
text <i>text</i><i></i> text text<i>text text</i> text

here the result I expect:

text <i>text text text</i> text
text <i>text text text</i> text
text<i>text </i>text text<i>text text</i> text
text <i>text</i> text text<i>text text</i> text

What have you tried so far? Please read the [**About**](http://stackoverflow.com/tour) page soon and also visit the links describing [**How to Ask a Question**](http://stackoverflow.com/questions/how-to-ask) and [**How to create a Minimal, Complete, and Verifiable example**](http://stackoverflow.com/help/mcve). — David C. Rankin, Sep 10 '19 at 23:11

TamaMcGlinn · Accepted Answer · 2019-09-12T08:26:56.887

's:</i><i>::g'

That captures the basic idea that we want to get rid of an endtag followed by a new start tag, but fails when encountering;

(1) capitalised tags:

Replace 'i' with '[Ii]' to match capitalized html tags; if you want the capitalization to remain as it was, rather than replacing with a lowercase i, put the match within a (group) and have \1 in the output side of the sed command.

(2) whitespace between the tags; to replace any number of spaces with a single space, we use an optional match group around the first space, which is put into the output:

's:</i>\(\ \)\?\s*<i>:\1:g'

The space and forward slash characters are escaped with a backslash, and the g at the end of each replacement allows it to match multiple times on each line.

(3) whitespace inside the tags should be matched with \s which captures both tabs and spaces. Oddly enough whitespace is allowed before the final > but not elsewhere in the tag. However, if a tag spans multiple lines you are screwed. Matching multiple lines is possible in sed but turns this into a script that is much too long for a single line.

After modifying for all three cases, the script line becomes:

sed -i 's:</[Ii]\s*>\(\ \)\?\s*<[Ii]\s*>:\1:g' yourfile.html

A note about the -i (in place replacement); this is an option in GNU sed, not standard sed. OSX has -i, but needs an extra '' parameter after the -i. If your sed does not support -i, you need to redirect to a new file, then mv that file to replace the original:

sed 'the same command' > newfile.html
mv newfile.html yourfile.html

See the SO question: 'sed edit file in place' for more information on that.

Your solution is not case insensitive as asked (/I) and I think it should be possible with a single replacement: `sed -r 's:( )?[ ]*:\1:gI'` Also, op did not mention this, but HTML tags can contain whitespace (even newlines) and still be valid, which would make things more complicated, if that were a requirement. — Erik Lievaart, Sep 11 '19 at 23:56
@ErikLievaart Thanks for the insight; using an optional group solves the problem neatly. — TamaMcGlinn, Sep 12 '19 at 08:28
Amazing!!, thanks a lot for your help, for me is very useful :) — fireDevelop.com, Sep 14 '19 at 16:28

shell html tag remove duplicates

1 Answers1