2

Suppose I have a few lines out of wikipedia XML that looks like this:

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

I want to remove the line that begins with [[Image:" and closed by "observances]]. There could be several other lines of text that have brackets as well and I don't want to do a greedy search otherwise it may accidentally remove those other brackets too.

For example, if I just did a greedy \\[\\[Image:.*\\]\\], I believe it will remove everything up to the last closing brackets (Ericco Malatesta)

Is there a regular expression that can make this easier for me?

chown
  • 51,908
  • 16
  • 134
  • 170
Dan Q
  • 2,227
  • 3
  • 25
  • 36

5 Answers5

2

Lets see... what about using lazy repetition instead of greedy?

\[\[Image:.*?observances\]\]
aleph_null
  • 5,766
  • 2
  • 24
  • 39
  • Unfortunately, not every closing bracket will have the word "observances" occur just before it. – Dan Q Nov 02 '11 at 03:18
  • Oh, I misunderstood. So you want to remove everything from [[Image up until it's respective closing ]]?? Impossible with regex, since you have an unknown number of nested [[ ]] within. If you want a legit solution, you'll have to code it up yourself. If you use a stack, the algorithm is pretty straight forward. Just push a "[[" everytime you encounter a [[, and pop everytime you encounter a ]]. when the stack is empty, you've reached Image:'s closing tag. – aleph_null Nov 02 '11 at 04:07
0

What's up with this example?

s.replaceAll("(\\[{2}Image:(?:(?:\\[{2}).*\\]{2}|[^\\[])*\\]{2})", "");

Would replace this text only:

  • [[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]
TJR
  • 3,617
  • 8
  • 38
  • 41
0

This works:

str.replaceAll("^\\[\\[([^\\[]*?(\\[\\[[^\\]]*\\]\\])?[^\\[]*?)*?\\]\\]\\s*", "");

Output from your input:

In 1907, the [[International...

This works because it's looking for matching pairs of [[ and ]] (and surrounding text) inside the first such pair.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

Maybe like this:

(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])

I tried

public class My {

public static void main(String[] args) {
    String foo = "[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]";
    Matcher m = Pattern.compile("(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])").matcher(foo);
    while (m.find()) {
        System.out.print(m.group(1));
    }
}}

And it prints

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]

Hope this helps :D

gwokae
  • 76
  • 4
0

Using the following test string (note, I added an additional [[image:foobar[[foo [baz] bar]]foobar]] in there):

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of[[image:foobar[[foo [baz] bar]]foobar]] Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

And a regular expression pattern of:

(?i)\\[\\[image:(?:\\[\\[(?:(?!(?:\\[\\[|]])).)*]]|(?:(?!(?:\\[\\[|]])).)*?)*?]]

testString.replaceAll(<above pattern>, "") will return:

 In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

Here's a more detailed explanation of the regular expression:

(?i)                    # Case insensitive flag
\[\[image:              # Match literal characters '[[image:'
(?:                     # Begin non-capturing group
  \[\[                  # Match literal characters '[['
  (?:                   # Begin non-capturing group
    (?!                 # Begin non-capturing negative look-ahead group
      (?:               # Begin non-capturing group
        \[\[            # Match literal characters '[['
        |               # Match previous atom or next atom
        ]]              # Match literal characters ']]'
      )                 # End non-capturing group
    )                   # End non-capturing negative look-ahead group
    .                   # Match any character
  )                     # End non-capturing group
  *                     # Match previous atom zero or more times
  ]]                    # Match literal characters ']]'
  |                     # Match previous atom or next atom
  (?:                   # Begin non-capturing group
    (?!                 # Begin non-capturing negative look-ahead group
      (?:               # Begin non-capturing group
        \[\[            # Match literal characters '[['
        |               # Match previous atom or next atom
        ]]              # Match literal characters ']]'
      )                 # End non-capturing group
    )                   # End non-capturing negative look-ahead group
    .                   # Match any character
  )                     # End non-capturing group
  *?                    # Reluctantly match previous atom zero or more times
)                       # End non-capturing group
*?                      # Reluctantly match previous atom zero or more times
]]                      # Match literal characters ']]'

This will only handle one level of nested [[...]] patterns. As noted in this answer to this question that TJR commented about above, regular expressions will not handle unlimited nested atoms. So this regular expression pattern will not match something like [[foo[[baz]]bar]] within a [[image:...]] string.

For a great regular expressions reference, see Regular-Expressions.info.

Community
  • 1
  • 1
Go Dan
  • 15,194
  • 6
  • 41
  • 65