Regular expression to transform brackets and nested brackets when inside a markup

Question

I want to write a regex that can remove the brackets surrounding [cent]

String input1 = "this is a [cent] and [cent] string" 
String output1 = "this is a cent and cent string"

But if it is nested like:

String input2="this is a [cent[cent] and [cent]cent] string"
String output2="this is a cent[cent and cent]cent string"

I can only use replaceAll on the string so, how do I create the pattern in the code below ? and what should the replacement string be ?

Pattern rulerPattern1 = Pattern.compile("", Pattern.MULTILINE);
System.out.println(rulerPattern1.matcher(input1).replaceAll(""));

Update: nested brackets are well-formed and can be only two levels deep, like in case 2.

Edit: If this is the string "[<centd>[</centd>]purposes[<centd>]</centd>]"; then OUPTUT should be <centd>[</centd> purposes <centd>]</centd> .. basically if the brackets is between centd begin and end leave it there or else remove

can you refine your question a little bit? if there are multiple layers of nested [], what is your expected output? — keelar, Jun 11 '13 at 18:39
I don't understand your question. You only want to remove the "shortest" set of brackets (around each word "cent") but not the larger, outer brackets? — Thorn G, Jun 11 '13 at 18:39
Are the brackets always well-formed (every opening has a closing)? How about the levels? A java regex can only support nesting to a fixed-level (the more levels, the bigger the regex). — acdcjunior, Jun 11 '13 at 18:41
yes they are well formed and can be only two levels deep like in case 2 — Phoenix, Jun 11 '13 at 18:42
In theory, there is no way you can do this with Regular Expressions, as only Context-Free and Context-Sensitive Grammars can remember what brackets have already been seen. — David Christo, Jun 11 '13 at 18:43
The nested case is not very clear. How will you deal with `[text[more]text[than]text[3]othertext]`? — nhahtdh, Jun 11 '13 at 18:45
@DavidChristo: Regular expression works here (theoretical even), since the number of levels is limited. — nhahtdh, Jun 11 '13 at 18:46
@DavidChristo Regexes can handle bracket nesting up to a fixed-level. Flavors that support the recursive operator `?R` (Java doesn't) can handle nesting up to any level. — acdcjunior, Jun 11 '13 at 18:46
@acdcjunior Understood, but then, Java does not adhere to the theory behind Regular Expressions :) — David Christo, Jun 11 '13 at 18:48
@DavidChristo It does, it just doesn't support some operators. But I agree with you on grammars being a much better option for tasks like this. The thing is that regexes can do bracket nesting - it is messy and somehow limited, but they can. — acdcjunior, Jun 11 '13 at 18:51
@DavidChristo: The **theoretical** regular expression **can** do matching for this case, since OP has specified that the number of nested levels is limited. — nhahtdh, Jun 11 '13 at 18:52
Is regex needing to be used, as others have alluded to, you're trying to store a state in regex, and this isn't the idea behind regex. — dardo, Jun 11 '13 at 18:56
There shouldn't be a space after `and` in the output. It is not possible to do replacement if you don't have clear rule. And have you checked out my answer? It should work regardless. — nhahtdh, Jun 12 '13 at 04:26

Ro Yo Mi · Accepted Answer · 2013-06-17T20:13:12.073

6

Description

This regex would replace the brackets based on having space on only one side of the bracket.

regex: (?<=\s)[\[\]](?=\S)|(?<=\S)[\[\]](?=\s)

replace with empty string

enter image description here

Summary

Sample 1
- Input: this is a [cent[cent] and [cent]cent] string
- Output this is a cent[cent and cent]cent string
Sample 2
- Input: this is a [cent[cent] and [cent]cent] string
- Output this is a cent[cent and cent]cent string
Sample 3
- Input: [<cent>[</cent>] and [<cent>]Chemotherapy services.</cent>]
- Output [<cent>[</cent> and <cent>]Chemotherapy services.</cent>]

To address the edit on the question this expression will find:

[<centd>[</centd>] and replaces them with <centd>[</centd>
[<centd>] or [</centd>], and removes just the outer square brackets
all other square brackets are retained

regex: \[(<centd>[\[\]]<\/centd>)\]|\[(<\/?centd>)\]

replace with: $1$2

enter image description here

Sample 4
- Input: [<centd>[</centd>]purposes[<centd>]</centd>]
- Output <centd>[</centd>pur [T] poses<centd>]</centd>

edited Jun 17 '13 at 20:13

answered Jun 11 '13 at 19:16

Ro Yo Mi

14,790
5
35
43

How'd you make that picture? – Roddy of the Frozen Peas Jun 11 '13 at 19:19
1

@ Roddy of the Frozen Peas, I'm using debuggex.com. Although it doesn't support lookbehinds or atomic groups it's still handy for understanding the expression flow. There is also regexper.com. They do a pretty good job too, but it's not real time as you're typing. – Ro Yo Mi Jun 11 '13 at 19:20
Denomales, and what if the "cent" is actually a markup like [[]and []Chemotherapy services.] – Phoenix Jun 11 '13 at 23:38
That would be a different problem and would need addressed by changing the regular expression. Would you update your question to include that input and show what the desired output would be? – Ro Yo Mi Jun 12 '13 at 00:30
Providing there is a space before the word `and` in the input text `[[] and []Chemotherapy services.]` then this expression will output the desired `[[ and ]Chemotherapy services.]` – Ro Yo Mi Jun 12 '13 at 12:15
Denomales, this doesn't work for “[[]§ 431:10A–126 []Chemotherapy services.] Cancer treatment.test snl. – Phoenix Jun 13 '13 at 16:13
what is your desired output with `“[[]§ 431:10A–126 []Chemotherap‌y services.] Cancer treatment.test snl.`? – Ro Yo Mi Jun 13 '13 at 17:16
your sample inputs and outputs are not really consistent across all examples. Or it's not clear to me the rule set for removing/keeping brackets. – Ro Yo Mi Jun 13 '13 at 18:39
Sorry for the confusion, basically if this is the string "[[]purposes[]]"; then OUPTUT should be [ purposes ] .. basically if the brackets is between centd begin and end leave it there or else remove. – Phoenix Jun 13 '13 at 20:54
Ok I updated the answer to leave brackets inside `]` and to remove them anywhere else. – Ro Yo Mi Jun 13 '13 at 22:12
Denomales, this doesn't work String input ="[[] and []Chemotherapy services.]"; should produce, String output ="[ and ]Chemotherapy services."; but it removes all the brackets. – Phoenix Jun 17 '13 at 17:47
Denomales, any other brackets in the input should not be removed for e.g. String input = "[T]" should stay like that. I want the brackets Only around centd to be removed. – Phoenix Jun 17 '13 at 18:26
Updated answer to cover this. – Ro Yo Mi Jun 18 '13 at 00:06

Mena · Answer 2 · 2013-06-11T19:15:30.660

If it's really only about finding brackets surrounding "cent", you could use the following approach (with lookbehind, lookahead):

Edited to leave some of the brackets as per expected output: this is now a combination of positive and negative lookbehinds and lookaheads. In other words, it's unlikely that regex is the solution, but does work with the literals provided and then some.

// surrounding
String test1 = "this is a [cent] and [cent] string";
// pseudo-nested
String test2 = "this is a [cent[cent] and [cent]cent] string";
// nested
String test3 = "this is a [cent[cent]] and [cent]cent]] string";
Pattern pattern = Pattern.compile("((?<!cent)\\[+(?=cent))|((?<=cent)\\]+(?!cent))");
Matcher matcher = pattern.matcher(test1);
if (matcher.find()) {
    System.out.println(matcher.replaceAll(""));
}
matcher = pattern.matcher(test2);
if (matcher.find()) {
    System.out.println(matcher.replaceAll(""));
}
matcher = pattern.matcher(test3);
if (matcher.find()) {
    System.out.println(matcher.replaceAll(""));
}

Output:

this is a cent and cent string
this is a cent[cent and cent]cent string
this is a cent[cent and cent]cent string

If all you are doing is replaceAll, then you can just use the replaceAll method of String class. It also takes in a regex. — nhahtdh, Jun 11 '13 at 19:15

score 0 · Answer 3 · edited May 23 '17 at 11:50

0

Regular expressions are unfit for the purpose in general case. Nested structures is a recursive grammar, not a regular grammar. (That's why you don't parse HTML with regular expressions, BTW.)

If you only have a limited depth of bracket nesting, you can write a regular expression for that. Buy you need to state your nesting depth first, and the regexp will not be all that pretty.

edited May 23 '17 at 11:50

Community

1
1

answered Jun 11 '13 at 19:03

9000

39,899
9
66
104

The OP did state the nesting depth in the comment, but it is not yet edited into the question... – nhahtdh Jun 11 '13 at 19:15
@nhahtdh: found that, edited the question to include the limitation. – 9000 Jun 11 '13 at 19:29

score 0 · Answer 4 · edited May 23 '17 at 10:25

Assumptions

From the question, the assumption is that there are no more than 2 levels of nesting brackets. It is also assumed that the brackets are balanced.

I further makes the assumption that you don't allow escaping of [].

I also assume that when there are nested brackets, only the first opening [ and the last closing ] brackets of the inner brackets are preserved. The rest, i.e. the top level brackets and the rest of the inner brackets are removed.

For example:

only[single] [level] outside[text more [text] some [text]moreeven[more]text[bracketed]] still outside

After replacement will become:

onlysingle level outsidetext more [text some textmoreevenmoretextbracketed] still outside

Aside from the assumptions above, there is no other assumption.

If you can make the assumption about spacing before and after brackets, then you can use the simpler solution by Denomales. Otherwise, my solution below will work without such assumption.

Solution

private static String replaceBracket(String input) {
    // Search for singly and doubly bracketed text
    Pattern p = Pattern.compile("\\[((?:[^\\[\\]]++|\\[[^\\[\\]]*+\\])*+)\\]");
    Matcher matcher = p.matcher(input);

    StringBuffer output = new StringBuffer(input.length());

    while (matcher.find()) {
        // Take the text inside the outer most bracket
        String innerText = matcher.group(1);
        int startIndex = innerText.indexOf("[");
        int endIndex;

        String replacement;

        if (startIndex != -1) {
            // 2 levels of nesting
            endIndex = innerText.lastIndexOf("]");

            // Remove all [] except for first [ and last ]
            replacement = 
                // Text before and including first [
                innerText.substring(0, startIndex + 1) + 
                // Text inbetween, stripped of all the brackets []
                innerText.substring(startIndex + 1, endIndex).replaceAll("[\\[\\]]", "") +
                // Text after and including last ]
                innerText.substring(endIndex);
        } else {
            // No nesting
            replacement = innerText;
        }

        matcher.appendReplacement(output, replacement);
    }

    matcher.appendTail(output);

    return output.toString();
}

Explanation

The only thing that is worth explaining here is the regex. The rest you can check out the documentation of Matcher class.

"\\[((?:[^\\[\\]]++|\\[[^\\[\\]]*+\\])*+)\\]"

In RAW form (when you print out the string):

\[((?:[^\[\]]++|\[[^\[\]]*+\])*+)\]

Let us break it up (spaces are insignificant):

\[                    # Outermost opening bracket
(                     # Capturing group 1
  (?:
    [^\[\]]++         # Text that doesn't contain []
    |                 # OR
    \[[^\[\]]*+\]     # A nested bracket containing text without []
  )*+
)                     # End of capturing group 1
\]                    # Outermost closing bracket

I used possessive quantifiers *+ and ++ in order to prevent backtracking by the regex engine. The version with normal greedy quantifier \[((?:[^\[\]]+|\[[^\[\]]*\])*)\] would still work, but will be slightly inefficient and can cause a StackOverflowError on big enough input.

grepit · Answer 5 · 2013-06-12T15:05:09.160

-1

You can use java matcher to transform brackets. I did the one for you below:

         String input = "this is a [cent[cent] and [cent]cent] string";
         Pattern p = Pattern.compile("\\[((?:[^\\[\\]]++|\\[[^\\[\\]]*+\\])*+)\\]");
         Matcher m = p.matcher(input);

edited Jun 12 '13 at 15:05

answered Jun 11 '13 at 19:05

grepit

21,260
6
105
81

please add comments if you are marking it down, so i can better revise it. Thanks – grepit Jun 11 '13 at 20:20
It doesn't seem that this will produce the expected output. – nhahtdh Jun 12 '13 at 04:23
@nhahtdh I revised it the pattern, would you please take a look? I think this will do what Phoenix is asking. – grepit Jun 12 '13 at 13:18
You just copied my pattern. And the pattern doesn't just work like that. – nhahtdh Jun 12 '13 at 14:15
@nhahtdh I corrected based on your feedback. if you want call that copying..that's fine. I was just trying to make sure correct answer. I am sorry I could not please you :( I did try though...maybe you can give me 1+ because I did correct then answer even though you did come up with the correct answer first. – grepit Jun 12 '13 at 15:07
Your answer is incomplete - it is not even able to achieve what OP wants. There is no reason for me to remove the downvote. You are free to delete the answer or edit it until it works. – nhahtdh Jun 12 '13 at 15:10
@nhahtdh I disagree with you respectfully but thanks for taking the time to provide your feedback. take care – grepit Jun 12 '13 at 15:15