-8
echo -e '<h1>abcd</h1>\n<h2>efgh</h2>' | sed 's#<h1>(.*?)<\h1>#<h1>\U&</h1>#g'

The desired output is:

<h1>ABCD</h1>
<h2>efgh</h2>

Any ideas? Thanks.

Roger
  • 8,286
  • 17
  • 59
  • 77
  • `(` and `)` in POSIX BRE are literal chars inside the pattern. Use `-E` option and replace `.*?` with `[^>]*` as lazy quntifiers are not supported by POSIX BRE/ERE. – Wiktor Stribiżew Sep 10 '19 at 16:47
  • 7
    Wrong tool for the job. Obligatory historical link: https://stackoverflow.com/a/1732454/14122 – Charles Duffy Sep 10 '19 at 16:47
  • 3
    `.*?` isn't supported syntax in *either* BRE or POSIX ERE. Neither is `\U` -- in using it, you're depending on nonstandard extensions (provided by your OS's copy of `sed`, as that tool is not part of bash and varies by platform). – Charles Duffy Sep 10 '19 at 16:48
  • 4
    ...so, if you want to do this *right*, use a real HTML parser -- Python's `lxml.html` is a great one. The cases where `sed` fails aren't just small and obscure -- put `

    ` on one line, the text on a second, and `

    ` on a third, and your code misses that there's a tag there. Or have `h:xmlns="http://www.w3.org/1999/xhtml"` in an enclosing scope and it needs to be `` that's recognized, but `sed` has no way to know that. Or make it `

    `, which is equivalent to an HTML parser, and it again won't match... etc, etc, etc.

    – Charles Duffy Sep 10 '19 at 16:51
  • Some down votes. A request to close. I just want to make it work with this very simple example. Nothing more... And till now, nothing. – Roger Sep 10 '19 at 17:14
  • You have `` in your input string but you are trying to match `<\h1>` within `sed`. This is just one of the problems you currently have. – revo Sep 10 '19 at 17:38

1 Answers1

1

This will work only for your case and is not parsing HTML.

DISCLAIMER

First read: https://stackoverflow.com/a/1732454/7939871

This parsing with a sed Search-and-replace Regular Expression is a shortcut interpretation.

It is in no way for use in any kind of production setup; as it would break on so many valid HTML syntax or layout variations like: namespaces, multi-line, spacing, nesting, use of attributes, entities, CDATA…

sed -E 's#<h1>(.*)</h1>#<h1>\U\1\E</h1>#g' <<<$'<h1>abcd</h1>\n<h2>efgh</h2>'

Basically, it switches-on upper-casing \U, then prints the captured group 1 \1, then switches-off upper-casing \E.

Léa Gris
  • 17,497
  • 4
  • 32
  • 41