2

I want to do replacements in MS Word (.docx) document using regular expression (java RegEx):

Example: 
 …, с одной стороны, и %SOME_TEXT% именуемое в дальнейшем «Заказчик», в 
 лице  %SOME_TEXT%   действующего на основании %SOME_TEXT% с другой стороны, 
 заключили настоящий Договор о нижеследующем: …

I tried to get text templates (like %SOME_TEXT%) use Apache POI - XWPF and replace text, but replacement is not guaranteed, because POI separates runs => I get something like this(System.out.println(run.getText(0))):

…
, с одной стороны, и 
%
SOME_TEXT
%

именуемое 
в дальнейшем «Заказчик», в лице

%
SOME
_
TEXT
%

code example:

FileInputStream fis = new FileInputStream(new File("document.docx"));
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
paragraphs.forEach(para -> {
    para.getRuns().forEach(run -> {
        String text = run.getText(0);
        if (text != null) {
           System.out.println(text);
           // text replacement process
           // run.setText(newText,0);
        }
    });
});

I have found many similar questions (like this "Replacing a text in Apache POI XWPF "), but did not found answer to my problem (answer here "Seperated text line in Apache POI XWPFRun object" offer inconvenient solution).

I tried to use docx4j and this example => "docx4j find and replace", but docx4j works similar.

For docx4j, see stackoverflow.com/questions/17093781/… – JasonPlutext

I tried to use docx4j => documentPart.variableReplace(mappings);, but replacement not guaranteed(plutext/docx4j).

Did you use VariablePrepare? stackoverflow.com/a/17143488/1031689 – JasonPlutext

Yes, no results:

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File("test.docx"));
HashMap<String, String> mappings = new HashMap<>();
VariablePrepare.prepare(wordMLPackage);//see notes
mappings.put("SOME_TEXT", "XXXX");
wordMLPackage.getMainDocumentPart().variableReplace(mappings);
wordMLPackage.save(new File("out.docx"));

Input\output text:

Input:
…, с одной стороны, и ${SOME_TEXT} именуемое в дальнейшем «Заказчик» ...
Output:
…, с одной стороны, и SOME_TEXT именуемое в дальнейшем «Заказчик» ...

To see your runs after VariablePrepare, turn on INFO level logging for VariablePrepare, or just System.out.println(wordMLPackage.getMainDocumentPart().getXML())

I understand that templates were separated to different Runs, but main question of the topic, how not to separate template to different Runs. I use System.out.println(wordMLPackage.getMainDocumentPart().getXML()) and saw:

<w:r>
   <w:t xml:space="preserve">, с одной стороны, и </w:t>
</w:r>
<w:r><w:t>$</w:t></w:r>
<w:r><w:t>{</w:t></w:r>
<w:r>
    <w:rPr>
       <w:rFonts w:eastAsia="Times-Roman"/>
          <w:color w:val="000000" w:themeColor="text1"/>
          <w:lang w:val="en-US"/>
    </w:rPr>
    <w:t>SOME</w:t>        <!-- First part of template: "SOME" -->
</w:r>
<w:r>
    <w:rPr>
        <w:rFonts w:eastAsia="Times-Roman"/>
        <w:color w:val="000000" w:themeColor="text1"/>
    </w:rPr>
    <w:t>_</w:t>           <!-- Second part of template: "_"   -->
</w:r>
<w:r>
    <w:rPr>
        <w:rFonts w:eastAsia="Times-Roman"/>
        <w:color w:val="000000" w:themeColor="text1"/>
        <w:lang w:val="en-US"/>
    </w:rPr>
    <w:t>TEXT</w:t>        <!-- Third part of template: "TEXT" -->
</w:r>
<w:r>
    <w:rPr>
        <w:rFonts w:eastAsia="Times-Roman"/>
        <w:color w:val="000000" w:themeColor="text1"/>
    </w:rPr>
    <w:t>}</w:t>
</w:r>

, that template located in different xml tags and I do not understand WHY...

Please help me to find convenient approach to replace text.....

kozmo
  • 4,024
  • 3
  • 30
  • 48
  • For docx4j, see https://stackoverflow.com/questions/17093781/docx4j-does-not-replace-variables – JasonPlutext Apr 05 '18 at 21:14
  • I tried to use **docx4j** => `documentPart.variableReplace(mappings);`, but replacement not guaranteed(see question updates). – kozmo Apr 06 '18 at 07:02
  • Did you use VariablePrepare? https://stackoverflow.com/a/17143488/1031689 – JasonPlutext Apr 06 '18 at 10:28
  • 1
    First POI doesn't separate the runs. If the text is all in a single run, then POI will find it that way. If the text is broken up into multiple runs (lots of things you can do in Microsoft Word will cause this). Then any solution that retrieves text by run will suffer from the same issue. You have to consolidate the text from the runs into a single string. before comparing. – jmarkmurphy Apr 06 '18 at 13:13
  • Thx, @jmarkmurphy, I know, but I would like to find a way to simply change text in .docx/doc file and do not afraid of separation template's runs. – kozmo Apr 06 '18 at 14:49
  • @JasonPlutext - I tried VariablePrepare => look at updates. – kozmo Apr 06 '18 at 20:43
  • As you see, the approach "to do replacements in MS Word (.docx) document using regular expression (java RegEx)" is not really good since you never can be sure that the text to replace will be together in one text-run. Better approach is using fields (merge fields or form fields) or content controls in Word. – Axel Richter Apr 07 '18 at 05:20
  • @Axel Richter - add example, please. – kozmo Apr 07 '18 at 05:39
  • To see your runs after VariablePrepare, turn on INFO level logging for VariablePrepare, or just System.out.println(wordMLPackage.getMainDocumentPart().getXML()) – JasonPlutext Apr 07 '18 at 07:53
  • @JasonPlutext - thx, but I want to put text(*template*) from another document into `${}`(or something like this), and then use any approach for change this *template* to actual `String` value (and save *priperties* of text). Now MS Word separate my *templates* to different **Runs** and properties of these runs is the same. I try to fight with it.... – kozmo Apr 07 '18 at 08:42
  • Docx4j didn't combine your 3 runs since they have different rPr values (the middle one lacks w:lang). If I'm understanding the requirement you just articulated ("text(template)"), you want to insert an arbitrary chunk of OpenXML into your document in place of the variable? – JasonPlutext Apr 07 '18 at 11:47

1 Answers1

6

As you see, the approach "to do replacements in MS Word (.docx) document using regular expression (java RegEx)" is not really good since you never can be sure that the text to replace will be together in one text-run. Better approach is using fields (merge fields or form fields) or content controls in Word.

My favourites for such requirements are still the good old form fields in Word.

First advantage is that even without document protection it will not be possible formatting parts of form field content different and so tearing apart the form field content into different runs (but see note 1). Second advantage is that because of the gray background the form fields are good visible in document content. And another advantage is the possibility applying a document protection so that only filling the form fields will be possibly, even in Word' s GUI. This is really good for preserving such contractual documents from unwanted changings.

(Note 1): At least Word prevents formatting parts of form field content different and so tearing apart the form field content into different runs. Other word-processing software (Writer for example) may not respecting this restriction though.

So I would have the Word template like so:

enter image description here

The grey fields are the good old form Textfields in Word, named Text1, Text2 and Text3. Textfields blocks look like:

<xml-fragment w:rsidR="00833656" 
  ...
 xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
 ... >
  <w:rPr>
    <w:rFonts w:eastAsia="Times-Roman"/>
    <w:color w:themeColor="text1" w:val="000000"/>
    <w:lang w:val="en-US"/>
  </w:rPr>
    <w:fldChar w:fldCharType="begin">
      <w:ffData>
        <w:name w:val="Text1"/>
        <w:enabled w:val="0"/>
        <w:calcOnExit w:val="0"/>
        <w:textInput>
          <w:default w:val="<введите заказчика>"/>
        </w:textInput>
      </w:ffData>
    </w:fldChar>
  </xml-fragment>
</xml-fragment>

Then the following code:

import java.io.FileOutputStream;
import java.io.FileInputStream;

import org.apache.poi.xwpf.usermodel.*;

import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import org.apache.xmlbeans.SimpleValue;
import javax.xml.namespace.QName;

public class WordReplaceTextInFormFields {

 private static void replaceFormFieldText(XWPFDocument document, String ffname, String text) {
  boolean foundformfield = false;
  for (XWPFParagraph paragraph : document.getParagraphs()) {
   for (XWPFRun run : paragraph.getRuns()) {
    XmlCursor cursor = run.getCTR().newCursor();
    cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//w:fldChar/@w:fldCharType");
    while(cursor.hasNextSelection()) {
     cursor.toNextSelection();
     XmlObject obj = cursor.getObject();
     if ("begin".equals(((SimpleValue)obj).getStringValue())) {
      cursor.toParent();
      obj = cursor.getObject();
      obj = obj.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//w:ffData/w:name/@w:val")[0];
      if (ffname.equals(((SimpleValue)obj).getStringValue())) {
       foundformfield = true;
      } else {
       foundformfield = false;
      }
     } else if ("end".equals(((SimpleValue)obj).getStringValue())) {
      if (foundformfield) return;
      foundformfield = false;
     }
    }
    if (foundformfield && run.getCTR().getTList().size() > 0) {
     run.getCTR().getTList().get(0).setStringValue(text);
     foundformfield = false;
//System.out.println(run.getCTR());
    }
   }
  }
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("WordTemplate.docx"));

  replaceFormFieldText(document, "Text1", "Моя Компания");
  replaceFormFieldText(document, "Text2", "Аксель Джоачимович Рихтер");
  replaceFormFieldText(document, "Text3", "Доверенность");

  FileOutputStream out = new FileOutputStream("WordReplaceTextInFormFields.docx");
  document.write(out);
  out.close();
  document.close();
 }
}

This code needs the full jar of all of the schemas ooxml-schemas-1.3.jar as mentioned in FAQ-N10025.

Produces:

enter image description here

Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • thx, it *helps*, but why `run.getCTR().getTList()` throw `java.lang.NoClassDefFoundError: org/openxmlformats/schemas/wordprocessingml/x2006/main/impl/CTRImpl$1TList`? I changed one to `run.getCTR().getTArray()`. Another question: How change background color of *grey-field*? – kozmo Apr 07 '18 at 10:41
  • 2
    This code needs the full jar of all of the schemas `ooxml-schemas-1.3.jar` as mentioned in [FAQ-N10025](https://poi.apache.org/faq.html#faq-N10025). [CTR.getTArray](http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/ooxml-schemas/1.1/org/openxmlformats/schemas/wordprocessingml/x2006/main/CTR.java#CTR.getTArray%28%29) is deprecated. And "How change background color of grey-field?": Not necessary in my opinion. So no answer to this from me. – Axel Richter Apr 07 '18 at 10:48
  • @Alex Richter - thx for your help. Do you know convenient approach to replace 'grey-fields'? – kozmo Apr 07 '18 at 11:15
  • 3
    Why do you think replacing 'gray-fields' is necessary? Do you think they will printed out gray? They will not. – Axel Richter Apr 07 '18 at 13:07
  • 3
    But if the gray is too ugly for you, see https://wordribbon.tips.net/T006107_Controlling_Field_Shading.html – Axel Richter Apr 07 '18 at 13:18
  • 1
    @AxelRichter: I know these types of comments are not really advised here on SO. But I just wanted say, thank you for being such an MVP. I've had a lot of help from your in-depth POI-related answers over the years. – Priidu Neemre Mar 09 '22 at 07:52