Matching a content within a large text using reg ex in java

Question

Problem : I need to match a content within a large text (Wikipedia dump consisting of xml pages) in java.
Content required: Infobox
Reg ex used : "\\{\\{Infobox(.*?)\\}\\}"

Issue: the above pattern matches the first occurrence of }} within the infobox and if I remove the ? character in the reg ex, the pattern matches the last occurrence. But, I am looking for extracting just the infobox and }} should match the end of the info box.

Ex info box:

{{infobox RPG
|title= Amber Diceless Roleplaying Game
|image= [[Image:Amber DRPG.jpg|200px]]
|caption= Cover of the main ''Amber DRPG'' rulebook (art by [[Stephen Hickman]])
|designer= [[Erick Wujcik]]
|publisher= [[Phage Press]]&lt;br&gt;[[Guardians of Order]]
|date= 1991
|genre= [[Fantasy]]
|system= Custom (direct comparison of statistics without dice)
|footnotes= 
}}

Code snippet:

String regex = "\\{\\{Infobox(.*?)\\}\\}";
Pattern p1 = Pattern.compile(regex, Pattern.DOTALL);
Matcher m1 = p1.matcher(xmlPage.getText());
String workgroup = "";
while(m1.find()){
    workgroup = m1.group();
}

I'm fairly sure this can't be done with a Java regular expression. I think you'll have to read the file and parse it the hard way. — Dawood ibn Kareem, Nov 07 '13 at 18:23
I am using sax parser to read the content of the xml and storing the info boxes in a user defined object. — Vinoth, Nov 07 '13 at 18:24
The issue here is if there is }} within the info box, the patter matches till that }} and not as we want i.e. entire info box. — Vinoth, Nov 07 '13 at 18:27
Info box format is {{Infobox ........}}. I need to extract this one. However, if there is a infobox like {{Infobox ...{{..}}..}}. my pattern matches till {{Infobox ...{{..}} alone. — Vinoth, Nov 07 '13 at 18:30
Oh so `infobox` block can be nested also? You example is not showing that. — anubhava, Nov 07 '13 at 18:32

Dragos · Answer 1 · 2013-11-07T21:38:49.667

1

You shloud try this regex:

String regex = "\\{\\{[Ii]nfobox([^\\}].*\\n+)*\\}\\}";

or

Pattern pattern = Pattern.compile("\\{\\{[Ii]nfobox([^\\}].*\\n+)*\\}\\}");

Explanation : the above regex expression looks for
1 . \\{\\{ - matches two {{
2. [Ii]nfobox - matches Infobox or infobox
3. ([^\\}\\}].*\\n+)* - matches the body of the infobox (the body doesn't contain }} and contains any kind of characters any number of times )
----3.a. [^\\}] - matches everything except }
----3.b. .* - matches any character any number of times
----3.c. \n+ - matches new line 1 or more times
4. \\}\\} - matches - ends with }}

edited Nov 07 '13 at 21:38

answered Nov 07 '13 at 18:34

Dragos

1,050
16
21

can you please explain this syntax? – Vinoth Nov 07 '13 at 18:52
The above syntax works but the code execution is not getting complete. It hangs on after printing some infoboxes. – Vinoth Nov 07 '13 at 18:57
This doesn't solve the problem. It's equivalent to the regexp in the "Code snippet" section of the original question, but less performant. The problem is what to do about `{{ ... }}` blocks nested inside the Infobox. – Dawood ibn Kareem Nov 07 '13 at 19:19
regex = "[iI]nfobox([^\}\}].*\n+)*" to get blocks nested inside the Infobox – Dragos Nov 07 '13 at 19:25
2

"`[^\\}\\}]` - matches everything except }}" is incorrect. It will match all characters except }. If you want to match everything except }}, use `(?!\}\})`, or in Java `(?!\\}\\})` – srbs Nov 07 '13 at 19:26

score 1 · Answer 2 · answered Nov 07 '13 at 19:13

The solution depends upon the nesting depth of {{ .. }} blocks inside the infobox block. If the inside blocks don't nest, that is there are {{ ... }} blocks but NOT {{ .. {{ .. }} .. }} blocks then you can try the regex: infobox([^\\{]*(\\{\\{[^\\}]*\\}\\})*.*?)\\}\\}

I tested this on the string: "A {{ start {{infobox abc {{ efg }} hij }}end }} B" and was able to match " abc {{ efg }} hij "

If the nesting of {{ .. }} blocks is deeper then a regex won't help because you can't specify to the regex engine how big the inner block is. To achieve that you need to count the number of opening {{ and closing }} sequences and extract the string in that fashion. That means you would be better off reading the text one character at a time and processing it.

Explanation of regex:

We start with infobox and then open the group capture parenthesis. We then look for a string of characters which are NOT {.

Following that we look for zero or more "groups" of the form {{ .. }} (BUT with no nested blocks there-in). Nesting is not allowed here because we use [^\\}] to look for the end of the block by only allowing non-} characters inside the block.

Finally we accept the characters just prior to the closing }}.

If you know the maximum nesting depth of `{{ .. }}` blocks in the text you could create a bloated regex to match correctly. — Abid H. Mujtaba, Nov 07 '13 at 19:16
+1. This is a good 99% solution - it will deal with the cases with just one level of nesting just fine. It also mentions that reading and processing the text without using a regexp is the best thing to do if there's likely to be more than one level of nesting. This makes it the best possible answer. As per my comment under the question, I don't believe there's a regexp that will solve the problem with arbitrarily many levels of nesting inside the infoboxes. — Dawood ibn Kareem, Nov 07 '13 at 19:22

score 0 · Answer 3 · edited May 23 '17 at 12:29

If your xmlPage.getText() will return content similar to this:

{{infobox ... }}{infobox .... {{ nested stuff }} }}{{infobox ... }} where you will have both multiple infoboxes on the same level and also nested stuff ( and the nested level can be anything ) then you can't use regexp to parse the content. Why ? because the structure behaves in similar way to html or xml and thus it behaves not like a regular structure. You can find multiple answers on the topic "regexp and html" to find good explanation to this problem. For example here: Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

But if you can guarantee that you won't have multiple infoboxes on the same level but only nested ones then you can parse the doc removing '?'.

score 0 · Answer 4 · answered Dec 11 '18 at 04:39

public static void extractValuesTest(String[] args) {
        String payloadformatstr= "selected card is |api:card_number| with |api:title|";
        String receivedInputString= "siddiselected card is 1234567 with dbs card";
        int firstIndex = payloadformatstr.indexOf("|");
        List<String> slotSplits= extarctString(payloadformatstr, "\\|(.*?)\\|");
        String[] mainSplits = payloadformatstr.split("\\|(.*?)\\|");
        int mainsplitLength = mainSplits.length;
        int slotNumber=0;
        Map<String,String> parsedValues = new HashMap<>();
        String replaceString="";
        int receivedstringLength = receivedInputString.length();
        for (String slot : slotSplits) {
            String[] slotArray = slot.split(":");
            int processLength = slotArray !=null ? slotArray.length : 0;
            String slotType = null;
            String slotKey = null;
            if(processLength == 2){
                slotType = slotArray[0];
                slotKey = slotArray[1];
            }
            /*String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:"";
            String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:"";
            int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
            int endIndex = receivedInputString.indexOf(slotAfter);
            String extractedValue = receivedInputString.substring(startIndex, endIndex);*/
            String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:null;
            String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:null;
            int startIndex = StringUtils.isEmpty(slotBefore) ?  0:receivedInputString.indexOf(slotBefore)+slotBefore.length();
            //int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
            int endIndex =  StringUtils.isEmpty(slotAfter) ? receivedstringLength: receivedInputString.indexOf(slotAfter);
            String extractedValue = (endIndex != receivedstringLength) ? receivedInputString.substring(startIndex, endIndex): 
                receivedInputString.substring(startIndex);
            System.out.println("Extracted value is "+extractedValue);
            parsedValues.put(slotKey, extractedValue);
            replaceString+=slotBefore+(extractedValue != null ? extractedValue:"");
            //String extractedValue = extarctSlotValue(receivedInputString,slotBefore,slotAfter);
            slotNumber++;
        }
        System.out.println(replaceString);
        System.out.println(parsedValues);
    }

    public static void replaceTheslotsWithValues(String payloadformatstr,String receivedInputString,String slotPattern,String statPatternOfSlot) {
        payloadformatstr= "selected card is |api:card_number| with |api:title|.";
        receivedInputString= "selected card is 1234567 with dbs card.";
        slotPattern="\\|(.*?)\\|";
        statPatternOfSlot="|";
        int firstIndex = payloadformatstr.indexOf(statPatternOfSlot);
        List<String> slotSplits= extarctString(payloadformatstr, slotPattern);
        String[] mainSplits = payloadformatstr.split(slotPattern);
        int mainsplitLength = mainSplits.length;
        int slotNumber=0;
        Map<String,String> parsedValues = new HashMap<>();
        String replaceString="";
        for (String slot : slotSplits) {
            String[] slotArray = slot.split(":");
            int processLength = slotArray !=null ? slotArray.length : 0;
            String slotType = null;
            String slotKey = null;
            if(processLength == 2){
                slotType = slotArray[0];
                slotKey = slotArray[1];
            }
            String slotBefore= (firstIndex != 0 && slotNumber < mainsplitLength) ? mainSplits[slotNumber]:"";
            String slotAfter= (firstIndex != 0 && slotNumber+1 < mainsplitLength) ? mainSplits[slotNumber+1]:"";
            int startIndex = receivedInputString.indexOf(slotBefore)+slotBefore.length();
            int endIndex = receivedInputString.indexOf(slotAfter);
            String extractedValue = receivedInputString.substring(startIndex, endIndex);
            System.out.println("Extracted value is "+extractedValue);
            parsedValues.put(slotKey, extractedValue);
            replaceString+=slotBefore+(extractedValue != null ? extractedValue:"");
            //String extractedValue = extarctSlotValue(receivedInputString,slotBefore,slotAfter);
            slotNumber++;
        }
        System.out.println(replaceString);
        System.out.println(parsedValues);
    }

Matching a content within a large text using reg ex in java

4 Answers4