1

I have the following part of string:

{{Infobox musical artist
|honorific-prefix  = [[The Honourable]]
| name = Bob Marley
| image = Bob-Marley.jpg
| alt = Black and white image of Bob Marley on stage with a guitar
| caption = Bob Marley in concert, 1980.
| background = solo_singer
| birth_name = Robert Nesta Marley
| alias = Tuff Gong
| birth_date = {{birth date|df=yes|1945|2|6}}
| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]
| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}
| death_place = [[Miami]], [[Florida]]
| instrument = Vocals, guitar, percussion
| genre = [[Reggae]], [[ska]], [[rocksteady]]
| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]] 
| years_active = 1962–1981
| label = [[Beverley's]], [[Studio One (record label)|Studio One]],
| associated_acts = [[Bob Marley and the Wailers]]
| website = {{URL|bobmarley.com}}
}}

And I'd like to remove all of it. Now if I try the regex: \{\{(.*?)\}\} it catches {{birth date|df=yes|1945|2|6}}, which makes sense so I tried : \{\{([^\}]*?)\}\} which thens grabs from the start but ends in the same line, which also makes sense as it has encoutered }}, i've also tried without the ? greedy ,still same results. my question is, how can I remove everything that's inside a {{}}, no matter how many of the same chars are inside?

Edit: If you want my entire input, it's this: https://en.wikipedia.org/w/index.php?maxlag=5&title=Bob+Marley&action=raw

eric.itzhak
  • 15,752
  • 26
  • 89
  • 142
  • Are there many of these structures in the input? – Bohemian Mar 03 '14 at 11:03
  • @Bohemian Yes there could be. I'm querying MediaWiki pages, and I know that often they do that. Maybe not specifically in wikipedia pages but in wiktionary they do. – eric.itzhak Mar 03 '14 at 11:05

4 Answers4

1

Here's a solution with a DOTALL Pattern and a greedy quantifier for an input that contains only one instance of the fragment you wish to remove (i.e. replace with an empty String):

String input = "Foo {{Infobox musical artist\n"
                + "|honorific-prefix  = [[The Honourable]]\n"
                + "| name = Bob Marley\n"
                + "| image = Bob-Marley.jpg\n"
                + "| alt = Black and white image of Bob Marley on stage with a guitar\n"
                + "| caption = Bob Marley in concert, 1980.\n"
                + "| background = solo_singer\n"
                + "| birth_name = Robert Nesta Marley\n"
                + "| alias = Tuff Gong\n"
                + "| birth_date = {{birth date|df=yes|1945|2|6}}\n"
                + "| birth_place = [[Nine Mile, Jamaica|Nine Mile]], [[Jamaica]]\n"
                + "| death_date = {{death date and age|df=yes|1981|5|11|1945|2|6}}\n"
                + "| death_place = [[Miami]], [[Florida]]\n"
                + "| instrument = Vocals, guitar, percussion\n"
                + "| genre = [[Reggae]], [[ska]], [[rocksteady]]\n"
                + "| occupation = [[Singer-songwriter]], [[musician]], [[guitarist]] \n"
                + "| years_active = 1962–1981\n"
                + "| label = [[Beverley's]], [[Studio One (record label)|Studio One]],\n"
                + "| associated_acts = [[Bob Marley and the Wailers]]\n"
                + "| website = {{URL|bobmarley.com}}\n" + "}} Bar";
//                                    |DOTALL flag
//                                    |  |first two curly brackets
//                                    |  |     |multi-line dot
//                                    |  |     | |last two curly brackets
//                                    |  |     | |        | replace with empty
System.out.println(input.replaceAll("(?s)\\{\\{.+\\}\\}", ""));

Output

Foo  Bar

Notes after comments

This case implies using regular expressions to manipulate markup language.

Regular expressions are not made to parse hierarchical markup entities, and would not serve in this case so this answer is only a stub for what would be an ugly workaround at best in this case.

See here for a famous SO thread on parsing markup with regex.

Community
  • 1
  • 1
Mena
  • 47,782
  • 11
  • 87
  • 106
  • What if there are many of these structures in the text? You regex would grab almost the entire input, which I doubt is OP's intention. – Bohemian Mar 03 '14 at 11:02
  • @Bohemian that's exactly what OP is asking. – Xabster Mar 03 '14 at 11:05
  • Yes there could be more then one – eric.itzhak Mar 03 '14 at 11:05
  • Then I would not use regex. This might require a parser, as it comes closer to markup. – Mena Mar 03 '14 at 11:06
  • @Mena The problem is the MediaWiki API is absoulutly terrible, and this is the only option, the content is really unexpected. And by the way this regex did grabed almost everything when tried with the full string(link in question) – eric.itzhak Mar 03 '14 at 11:08
  • @eric.itzhak the problem is that regex is not made for hierarchical structures such as markup (famous thread [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). As much as you can try, recognizing markup with regex always works through workarounds, hardly satisfying and mostly unreadable solutions. Either implement your own parser or use an existing API, it'll still be better than regex (unless you want to find-replace your text _manually_ (!!). – Mena Mar 03 '14 at 11:29
  • @Mena I see. Thanks. Please edit your question with a summary of why this wouldn't work and i'll accept it. – eric.itzhak Mar 03 '14 at 11:31
  • @Xabster No. He's not asking for this. This regex basically just matches from the first `{{` to the last `}}` - indiscriminately consuming everything between (including other blocks and intervening text) – Bohemian Mar 03 '14 at 12:35
0

Use a greedy quantifier instead of the reluctant one you're using.

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Edit: spoonfeeding: "\{\{.*\}\}"

Xabster
  • 3,710
  • 15
  • 21
  • I've tested it on http://regex101.com/ ,and it doesn't catch it, and to make sure i've teted in my program as well, same results... – eric.itzhak Mar 03 '14 at 10:58
  • I tested in java. Xabster 1 - 0 eric. – Xabster Mar 03 '14 at 11:00
  • Xabster, first of all I appriciate your help. I have also tested in Java, which doesn't seem to work so it suprises me...Am i doing it wrong? `str = str.replaceAll("\\{\\{(.*)\\}\\}", "");` – eric.itzhak Mar 03 '14 at 11:02
  • System.out.println("{{Infobox musical artist {{|honorific-prefix}} = [[The Honourable]] | name = Bob Marley | image = Bob-Marley.jpg}}".replaceAll("\\{\\{(.*)\\}\\}", "")); gives you what? – Xabster Mar 03 '14 at 11:11
  • As I said, right? I'm not sure what issue you're having now. – Xabster Mar 03 '14 at 11:19
  • If i print the same line of code with the exact string I provided in the question, it returns all of it – eric.itzhak Mar 03 '14 at 11:21
0

Try this pattern, it should take care of everything:

"\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D"

specify: DOTALL

code:

String result = searchText.replaceAll("\\D\\{\\{I.+[\\P{M}\\p{M}*+].+\\}\\}\\D", "");

example: http://fiddle.re/5n4zg

l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • What's \X? IDE doesn't recognize it has valid and it appear in docs has beging of hex chars – eric.itzhak Mar 03 '14 at 11:20
  • [`\X` matches a Unicode extended grapheme cluster character](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) – l'L'l Mar 03 '14 at 11:39
  • Are you only wanting to capture the Infobox? ... er, nvm I guess you found your solution. :) – l'L'l Mar 03 '14 at 11:51
  • actually I'd do like to use your solution in another part, but I can't combile your regex, i Believe because of the `\X`, it throws exception, and warns me that of invalid regex. from the Docs you provided it's under `not supported` – eric.itzhak Mar 03 '14 at 12:17
  • and ya, I only want to grab the infobox, regardless of wht may calm afterwads or before. – eric.itzhak Mar 03 '14 at 12:29
  • Again warning for ilegal regex – eric.itzhak Mar 03 '14 at 12:57
  • Still doesn't mate, thanks for your trying but I now understood that I had no idea that such thing as wiki markup existed, and what I basically tried to do is regex it, I should look for a wiki markup parser instead. – eric.itzhak Mar 03 '14 at 13:37
  • @eric.itzhak, check out > http://fiddle.re/5n4zg - you need to specify `dot matches new line` – l'L'l Mar 03 '14 at 14:27
  • @eric.itzhak, How doesn't it work? It was working for me just fine. – l'L'l Mar 03 '14 at 16:53
0

This regex matches a single such block (only):

\{\{([^{}]*?\{\{.*?\}\})*.*?\}\}

See a live demo.

In java, to remove all such blocks:

str = str.replaceAll("(?s)\\{\\{([^{}]*?\\{\\{.*?\\}\\})*.*?\\}\\}", "");
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Actually it worked! You're a life saver! even it has `{{ }}` in the rest of the string... without the `(?s)`, This worked: ` str = str.replaceAll("\\{\\{([^{}]*?\\{\\{.*?\\}\\})*.*?\\}\\}", "");` – eric.itzhak Mar 03 '14 at 13:00