13

I 'm trying to replace a template DOCX document with Apache POI by using the XWPFDocument class. I have tags in the doc and a JSON file to read the replacement data. My problem is that a text line seems separated in a certain way in DOCX when I change its extension to ZIP file and open document.xml. For example [MEMBER_CONTACT_INFO] text becomes [MEMBER_CONTACT_INFO and ] separately. POI reads this in the same way since the DOCX original is like this. This creates 2 XWPFRun objects in the paragraph which show the text as [MEMBER_CONTACT_INFO and ] separately.

My question is, is there a way to force POI to run like Word via merging related runs or something like that? Or how can I solve this problem? I 'm matching run texts while replacing and I can't find my tag because it is split into 2 different run object.

Best

Johnny000
  • 2,058
  • 5
  • 30
  • 59
zaferaltun
  • 141
  • 1
  • 6

5 Answers5

19

This wasted so much of my time once...

Basically, an XWPFParagraph is composed of multiple XWPFRuns, and XWPFRun is a contagious text that has a fixed same style.

So when you try writing something like "[PLACEHOLDER_NAME]" in MS-Word it will create a single XWPFRun. But if you somehow add a few things more, and then you go back and change "[PLACEHOLDER_NAME]" to something else it is never guaranteed that it will remain a single XWPFRun it is quite possible that it will split to two Runs. AFAIK this is how MS-Word works.

How to avoid splitting of Runs in such cases?

Solution: There are two solutions that I know of:

  1. Copy text "[PLACEHOLDER_NAME]" to Notepad or something. Make your necessary modification and copy it back and paste it instead of "[PLACEHOLDER_NAME]" in your word file, this way your whole "[PLACEHOLDER_NAME]" will be replaced with new text avoiding splitting of XWPFRuns.

  2. Select "[PLACEHOLDER_NAME]" and then click of MS-Word "Replace" option and Replace with "[Your-new-edited-placeholder]" and this will guarantee that your new placeholder will consume a single XWPFRun.

If you have to change your new placeholder again, follow step 1 or 2.

user2009750
  • 3,169
  • 5
  • 35
  • 58
2

Here is the java code to fix that separate text line issue. It will also handle the mult-format string replacement.

public static void replaceString(XWPFDocument doc, String search, String replace) throws Exception{
  for (XWPFParagraph p : doc.getParagraphs()) {
    List<XWPFRun> runs = p.getRuns();
    List<Integer> group = new ArrayList<Integer>();
    if (runs != null) {
      String groupText = search;
      for (int i=0 ; i<runs.size(); i++) {
        XWPFRun r = runs.get(i);
        String text = r.getText(0);
        if (text != null)
            if(text.contains(search)) {
              String safeToUseInReplaceAllString = Pattern.quote(search);
              text = text.replaceAll(safeToUseInReplaceAllString, replace);
              r.setText(text, 0);
            }
            else if(groupText.startsWith(text)){
              group.add(i);
              groupText = groupText.substring(text.length());
              if(groupText.isEmpty()){
                runs.get(group.get(0)).setText(replace, 0);
                for(int j = 1; j<group.size(); j++){
                  p.removeRun(group.get(j));
                }
                group.clear();
                groupText = search;
              }
            }else{
              group.clear();
              groupText = search;
            }
        }
    }
}
for (XWPFTable tbl : doc.getTables()) {
   for (XWPFTableRow row : tbl.getRows()) {
      for (XWPFTableCell cell : row.getTableCells()) {
         for (XWPFParagraph p : cell.getParagraphs()) {
            for (XWPFRun r : p.getRuns()) {
              String text = r.getText(0);
              if (text.contains(search)) {
                String safeToUseInReplaceAllString = Pattern.quote(search);
                text = text.replaceAll(safeToUseInReplaceAllString, replace);
                r.setText(text);
              }
            }
         }
      }
   }
}

}

kiraluo163
  • 31
  • 1
  • 1
    This helped me. The bit with removeRun doesn't work for me because the index changes as you delete - replacing that line with p.removeRun(1) fixes this, and it works a treat. – Marcus May 17 '19 at 10:05
  • 1
    I was too hasty, and didn't test my change with enough data. Instead, simply replace the for loop wrapping removeRun with one that goes in reverse. i.e. `for(int j=group.size()-1; j>=1; j--)` and it works for me. – Marcus May 17 '19 at 10:17
  • @Marcus i have similar issue, can u help or any1 here if you can help.. here is thr link to Q https://stackoverflow.com/q/65246636/13267143 – backToStack Dec 14 '20 at 14:41
  • I'm running into this same issue and taking this code as inspiration. I see two problems with this implementation: 1) the first Run could start with some other text and then contain a part of the search pattern. 2) the last Run could contain text that should not be removed or could contain the start into a new search pattern. Otherwise thank you for publishing this code- it gave me some ideas how to go about the solution. – Torsten Uhlmann May 17 '22 at 08:43
1

For me it didn't work as I expected (every time). In my case I used "${PLACEHOLDER} in the text. At first we need to take a look how Apache Poi recognize each Paragraph which we want to iterate through with Runs. If you go deeper with docx file construction you will know that one run is a sequence of characters of text with the same font style/font size/colour/bold/italic etc. That way placeholder sometimes was divided into parts OR sometimes whole paragraph was recognized as a one Run and it was impossible to iterate through words.
What I did is to bold placeholder name in a template document. Than when iterating through RUN I was able to iterate through whole placeholder name ${PLACEHOLDER}. When I replaced that value with

for (XWPFRun r : p.getRuns()) {
  String text = r.getText(0);
  if (text != null && text.contains("originalText")) {
     text = text.replace("originalText", "newText");
     r.setText(text,0);
     }
  }

I've added just r.isBold(false); after setText.
That way placeholder is recognized as a different run -> I'm able to replace specific placeholder, and in the processed document I have no bolding, just a plain text.
For me one of a additional advantage was that visualy I'm able to faster find placeholders in text. So finally above loop looks like that:

for (XWPFRun r : p.getRuns()) {
      String text = r.getText(0);
      if (text != null && text.contains("originalText")) {
         text = text.replace("originalText", "newText");
         r.setText(text,0);
         r.isBold(false);
         }
      }

I hope it will help to someone, while I spend too much time for that :)

EnGoPy
  • 333
  • 4
  • 14
1

To be sure that a word will be consider as a single XWPFRun, You can use merge_field as variable in word like that

  1. Place cursor on the word you want to be a single run.
  2. Press CTRL and F9 together and { } in gray will appear.
  3. Right-click on the { } field and select Edit Field.
  4. In pop-up box, select Mail Merge from Categories and then MergeField from Field Names.
  5. Click OK.
Yehouda
  • 112
  • 1
  • 6
  • I used your method, but I don't understand why my MergeField is surrounded by «», what could be the reason? – The Prototype May 24 '23 at 09:08
  • @ThePrototype The «» are used to define the merged fields in Microsoft Word. you can Press ALT + F9 to toggle Field Codes on/off. – Yehouda May 25 '23 at 12:11
  • @Yehound, thanks, but then it looks something like this: `{ MERGEFIELD ${placeholder} }` – The Prototype May 25 '23 at 17:20
  • when you go to replace the merge fields with a value with Apache POI, the «» should disappear – Yehouda May 28 '23 at 15:24
  • @Yebouda Unfortunately this is not happening https://stackoverflow.com/questions/76322217/replace-text-placeholder-in-docx-with-mergefield-and-apache-poi?noredirect=1#comment134626598_76322217 – The Prototype May 29 '23 at 08:16
  • @ThePrototype my merge fields looks like that { MERGE FIELD placeholder} (without the $ and the other pair of arrow) maybe that is the reason. and to isolate the variable I use _ before and after like that : { MERGE FIELD __placeholder_ _} – Yehouda May 30 '23 at 10:05
0

I also had this issue few days ago and I couldn't find any solution. I chose to use PLACEHOLDER_NAME instead of [PLACEHOLDER_NAME]. This is working fine for me and it's seen like a single XWPFRun object.