Bash: uppercase text inside html tag with sed

Question

echo -e '<h1>abcd</h1>\n<h2>efgh</h2>' | sed 's#<h1>(.*?)<\h1>#<h1>\U&</h1>#g'

The desired output is:

<h1>ABCD</h1>
<h2>efgh</h2>

Any ideas? Thanks.

`(` and `)` in POSIX BRE are literal chars inside the pattern. Use `-E` option and replace `.*?` with `[^>]*` as lazy quntifiers are not supported by POSIX BRE/ERE. — Wiktor Stribiżew, Sep 10 '19 at 16:47
Wrong tool for the job. Obligatory historical link: https://stackoverflow.com/a/1732454/14122 — Charles Duffy, Sep 10 '19 at 16:47
`.*?` isn't supported syntax in *either* BRE or POSIX ERE. Neither is `\U` -- in using it, you're depending on nonstandard extensions (provided by your OS's copy of `sed`, as that tool is not part of bash and varies by platform). — Charles Duffy, Sep 10 '19 at 16:48
...so, if you want to do this *right*, use a real HTML parser -- Python's `lxml.html` is a great one. The cases where `sed` fails aren't just small and obscure -- put `
` on one line, the text on a second, and `
` on a third, and your code misses that there's a tag there. Or have `h:xmlns="http://www.w3.org/1999/xhtml"` in an enclosing scope and it needs to be `` that's recognized, but `sed` has no way to know that. Or make it `
`, which is equivalent to an HTML parser, and it again won't match... etc, etc, etc. — Charles Duffy, Sep 10 '19 at 16:51
Some down votes. A request to close. I just want to make it work with this very simple example. Nothing more... And till now, nothing. — Roger, Sep 10 '19 at 17:14
You have `` in your input string but you are trying to match `<\h1>` within `sed`. This is just one of the problems you currently have. — revo, Sep 10 '19 at 17:38

Léa Gris · Accepted Answer · 2019-09-12T09:33:42.340

This will work only for your case and is not parsing HTML.

DISCLAIMER

First read: https://stackoverflow.com/a/1732454/7939871

This parsing with a sed Search-and-replace Regular Expression is a shortcut interpretation.

It is in no way for use in any kind of production setup; as it would break on so many valid HTML syntax or layout variations like: namespaces, multi-line, spacing, nesting, use of attributes, entities, CDATA…

sed -E 's#<h1>(.*)</h1>#<h1>\U\1\E</h1>#g' <<<$'<h1>abcd</h1>\n<h2>efgh</h2>'

Basically, it switches-on upper-casing \U, then prints the captured group 1 \1, then switches-off upper-casing \E.

Bash: uppercase text inside html tag with sed

` on one line, the text on a second, and `

`, which is equivalent to an HTML parser, and it again won't match... etc, etc, etc.

1 Answers1