A regex to return text from parsed word document

Question

I trying to create a regex to match a portion of text in my word document. in the word document I have something like this {LigneDetails.Libelle} so when I treat this file with java it generates like this :

<w:t>{</w:t>
         </w:r>
         <w:proofErr w:type="spellStart" />
         <w:r w:rsidRPr="009664EA">
            <w:t>SOCIETE.RaisonSociale</w:t>
         </w:r>
         <w:proofErr w:type="spellEnd" />
         <w:r w:rsidRPr="009664EA">
 <w:t>}</w:t>

so here I match that text between the curved brackets using this regex : \\{([^\\{])*\\}, this will return :

{</w:t>
         </w:r>
         <w:proofErr w:type="spellStart" />
         <w:r w:rsidRPr="009664EA">
            <w:t>SOCIETE.RaisonSociale</w:t>
         </w:r>
         <w:proofErr w:type="spellEnd" />
         <w:r w:rsidRPr="009664EA">
            <w:t>}

Now in my word document I have something like this : {LigneDetails.Libelle:FAM:01}

This will generate :

<w:t>{</w:t>
    </w:r>
    <w:proofErr w:type="spellStart" />
    <w:r w:rsidRPr="002A51DD">
       <w:rPr>
          <w:sz w:val="14" />
          <w:szCs w:val="20" />
       </w:rPr>
       <w:t>LigneDetails.Libelle:FAM</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd" />
    <w:r w:rsidRPr="002A51DD">
       <w:rPr>
          <w:sz w:val="14" />
          <w:szCs w:val="20" />
       </w:rPr>
       <w:t>:01}</w:t>

then the regex will match the portion :

{</w:t>
                  </w:r>
                  <w:proofErr w:type="spellStart" />
                  <w:r w:rsidRPr="002A51DD">
                     <w:rPr>
                        <w:sz w:val="14" />
                        <w:szCs w:val="20" />
                     </w:rPr>
                     <w:t>LigneDetails.Quantite:FAM</w:t>
                  </w:r>
                  <w:proofErr w:type="spellEnd" />
                  <w:r w:rsidRPr="002A51DD">
                     <w:rPr>
                        <w:sz w:val="14" />
                        <w:szCs w:val="20" />
                     </w:rPr>
                     <w:t>:01}

until now all is fine.

Now I want to match the last two values which is always come after :, in my case that would be FAM and 01 so I want this regex to return these two values.

how can I do that ?

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — teukkam, Sep 23 '16 at 09:58
Isn't your regex wrong? You posted `\{([^\{])*\}` (removed the double escapes that Java requires) but shouldn't you be using `\{([^\}])*\}` instead, i.e. matching anything not a _right curly brace_? — Thomas, Sep 23 '16 at 09:58
The problem with your requirements is that getting the _right_ colons (`:`) _reliably_ is very hard (see teukkam's comment which probably tries to get at the point that parsing XML/HTML with regex is a hard task and bound to fail if the structure of the markup is beyond your control - XML/HTML are not regular languages so regular expressions don't fit very well). — Thomas, Sep 23 '16 at 10:02
@Thomas my regex is actually working and returning what I want — Renaud is Not Bill Gates, Sep 23 '16 at 10:11
@Thomas so what about if I used `\{([^\}])*\}` to get the text between the two curly brackets and then create a regex that will match my requirements on the returned string. I noticed that the text I want is always between `` and ``, so my regex will search in the returned string above for texts between `` and `` (in my case this would return `{`, `LigneDetails.Quantite:FAM` and `:01}`), then it will look for the texts that comes after `:` and this would return `FAM` and `01`, is this make any sense ? — Renaud is Not Bill Gates, Sep 23 '16 at 10:29
That _might_ work but as long as you can't verify the structure _always_ looks like this it still might break. — Thomas, Sep 23 '16 at 10:36
@Thomas so how would the regex looks like if I wanted to do that ? — Renaud is Not Bill Gates, Sep 23 '16 at 10:48
As I said, regex and xml/html are not good fit so I can't provide any catch-all expression. You'd probably be better off using a proper parser. — Thomas, Sep 23 '16 at 11:05

score 1 · Accepted Answer · answered Sep 23 '16 at 11:45

If we take into account your current approach, you are left with some {...} strings where you either have <...> entities or text or the { at the start and } at the end that you can remove with regex. Then, you need to just grab the lines and split with :, or use a regex to grab all non-whitespace chars after : symbols.

A sample Java code:

String str = "{</w:t>\n                  </w:r>\n                  <w:proofErr w:type=\"spellStart\" />\n                  <w:r w:rsidRPr=\"002A51DD\">\n                     <w:rPr>\n                        <w:sz w:val=\"14\" />\n                        <w:szCs w:val=\"20\" />\n                     </w:rPr>\n                     <w:t>LigneDetails.Quantite:FAM</w:t>\n                  </w:r>\n                  <w:proofErr w:type=\"spellEnd\" />\n                  <w:r w:rsidRPr=\"002A51DD\">\n                     <w:rPr>\n                        <w:sz w:val=\"14\" />\n                        <w:szCs w:val=\"20\" />\n                     </w:rPr>\n                     <w:t>:01}"; 
str = str.replaceAll("<[^<]*?>|^\\{|\\}$", "");
String[] lines = str.split("\n");
List<String> lst = new ArrayList<>();
for (String s : lines) {
    if (s.contains(":"))
        lst.add(s.trim().split(":")[1]);
}
System.out.println(lst);

See the Java demo

Or a version with a :(\S+) regex grabbing 1+ non-whitespace chunks from the stripped string contents:

str = str.replaceAll("<[^<]*?>|^\\{|\\}$", "");
Matcher m = Pattern.compile(":(\\S+)").matcher(str);
List<String> lst = new ArrayList<>();
while (m.find()) {
    lst.add(m.group(1));
}

See another demo

A regex to return text from parsed word document

1 Answers1