1

I have to modify a html-like text with the sed command. I have to delete substrings starting with one or more < chars, then having 0 or more occurrences of any characters but angle brackets and then any 1 or more > chars.

For example: from aaa<bbb>ccc I would like to get aaaccc

I am able to do this with

"s/<[^>]\+>//g"

but this command doesn't work if between <> characters is an empty string, or if there is double <<>> in the text. For example, from

aa<>bb<cc>vv<<gg>>h

I get

aa<>bbvv>h

instead of

aabbvvh

How can I modify it to give me the right result?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ann
  • 53
  • 6

2 Answers2

2

The problem is that once you allow nesting the < and > characters, you convert the language type from "regular" to "context free".

Regular languages are those that are matched by regular expressions, while context free grammars cannot be parsed in general by a regular expression. The unbounded level of nesting is what impedes this, needing a pile based automaton to be able to parse such languages.

But there's a little complicated workaround to this, if you consider that there's an upper limit to the level of nesting you will allow in the text you are facing, then you can convert into regular a language that is not, based on the premise that the non-regular cases will never occur:

Let's suppose you will never have more than three levels of nesting into your pattern, (this allows you to see the pattern and be able to extend it to N levels) you can use the following algorithm to build a regular expression that will allow you to match three levels of nesting, but no more (you can make a regexp to parse N levels, but no more, this is the umbounded bounded nature of regexps :) ).

Let's construct the expression recursively from the bottom up. With only one level of nesting, you have only < and > and you cannot find neither of these inside (if you allow < you allow more nesting levels, which is forbidden at level 0):

{l0} = [^<>]*

a string including no < and > characters.

Your matching text will be of this class of strings, surrounded by a pair of < and > chars:

{l1} = <[^<>]*>

Now, you can build a second level of nesting by alternating {l0}{l1}{l0}{l1}...{l0} (this is, {l0}({l1}{l0})* and surrounding the whole thing with < and >, to build {l2}

{l2} = <{l0}({l1}{l0})*> = <[^<>]*(<[^<>]*>[^<>]*)*>

Now, you can build a third, by alternating sequences of {l0} and {l2} in a pair of brackets... (remember that {l-i} represents a regexp that allows upto i levels of nesting or less)

{l3} = <{l0}({l2}{l0})*> = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>

and so on, successively, you form a sequence of

{lN} = <{l0}({l(N-1)}{l0})*>

and stop when you consider there will not be a deeper nesting in your input file.

So your level three regexp is:

<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l3--------------------------------------}
<{l0--}({l2---------------------}{l0--})*>
        <{l0--}({l1----}{l0--})*>
                <{l0--}>          

You can see that the regexp grows as you consider more levels. The good things is that you can consider a maximum level of three or four and most text will fit in this cathegory.

See demo.

NOTE

Never hesitate to build a regular expression, despite of it appearing somewhat complex. Think that you can build it inside your program, just using the techniques I've used to build it (e.g. for a 16 level nesting regexp, you'll get a large string, very difficult to write it by hand, but very easy to build with a computer)

package com.stackoverflow.q61630608;

import java.util.regex.Pattern;

public class NestingRegex {

    public static String build_regexp( char left, char right, int level ) {
        return level == 0
                ? "[^" + left + right + "]*"
                : level == 1
                        ? left + build_regexp( left, right, 0 ) + right
                        : left + build_regexp( left, right, 0 )
                        + "(" + build_regexp( left, right, level - 1 )
                        + build_regexp( left, right, 0 )
                        + ")*" + right;
    }

    public static void main( String[] args ) {
        for ( int i = 0; i < 5; i++ )
            System.out.println( "{l" + i + "} = "
                    + build_regexp( '<', '>', i ) );
        Pattern pat = Pattern.compile( build_regexp( '<', '>', 16 ), 0 );
        String s = "aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp";
        System.out.println(
                String.format( "pat.matcher(\"%s\").replaceAll(\"@\") => %s",
                               s, pat.matcher( s ).replaceAll( "@" ) ) );
    }


}

which, on run gives:

{l0} = [^<>]*
{l1} = <[^<>]*>
{l2} = <[^<>]*(<[^<>]*>[^<>]*)*>
{l3} = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l4} = <[^<>]*(<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>[^<>]*)*>
pat.matcher("aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp").replaceAll("@") => aa@bb@vv@h@ppp

The main advantage of using regular expressions is that once you have written it, it compiles into an internal representation that only has to visit each character of the string being matched once, leading to a very efficient final matching code (probably you'll not get so efficient writing the code yourself)

Sed

for sed, you only need to generate an enough deep regexp, and use it to parse your text file:

sed 's/<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>//g' file1.xml

will give you appropiate results (this is 6 levels of nesting or less ---remember the ( and ) must be escaped to be considered group delimiters in sed)

Your regexp can be constructed using shell variables with the following approach:

l0="[^<>]*"
l1="<${l0}>"
l2="<${l0}\(${l1}${l0}\)*>"
l3="<${l0}\(${l2}${l0}\)*>"
l4="<${l0}\(${l3}${l0}\)*>"
l5="<${l0}\(${l4}${l0}\)*>"
l6="<${l0}\(${l5}${l0}\)*>"
echo regexp is "${l6}"
regexp is <[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>
sed -e "s/${l6}/@/g" <<EOF
aa<>bb<cc>vv<<gg>>h<iii<jj<>j>k<k>k<<lll>mmm>ooo>ppp
EOF
aa@bb@vv@h@ppp

(I've used @ as substitution pattern, instead, so you can see where in the input string have the patterns been detected)

Community
  • 1
  • 1
Luis Colorado
  • 10,974
  • 1
  • 16
  • 31
1

You may use

sed 's/<\+[^>]*>\+//g'
sed 's/<\{1,\}[^>]*>\{1,\}//g'
sed -E 's/<+[^>]*>+//g'

The patterns match

  • <\+ / <\{1,\} - 1 or more occurrences of < char
  • [^>]* - negated bracket expression that matches 0 or more chars other than >
  • >\+ / >\{1,\} - 1 or more occurrences of > char

Note that in the last, POSIX ERE, example, + that is unescaped is a quantifier matching 1 or more occurrences, same as \+ in the POSIX BRE pattern.

See the online sed demo:

s='aa<>bb<cc>vv<<gg>>h'
sed 's/<\+[^>]*>\+//g' <<< "$s"
sed 's/<\{1,\}[^>]*>\{1,\}//g' <<< "$s"
sed -E 's/<+[^>]*>+//g' <<< "$s"

Result of each sed command is aabbvvh.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563