6

I am using Apache POI to replace words of docx. For a normal paragraph, I success to use XWPFParagraph and XWPFRun to replace the words. Then I tried to replace words in text box. I referenced this https://stackoverflow.com/a/25877256 to get text in text box. I success to print the text in console. However, I failed to replace words in text box. Here are some of my codes:

    for (XWPFParagraph paragraph : doc.getParagraphs()) {
        XmlObject[] textBoxObjects =  paragraph.getCTP().selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' .//*/wps:txbx/w:txbxContent");
            for (int i =0; i < textBoxObjects.length; i++) {
                XWPFParagraph embeddedPara = null;
                try {
                XmlObject[] paraObjects = textBoxObjects[i].
                    selectChildren(
                    new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));

                for (int j=0; j<paraObjects.length; j++) {
                    embeddedPara = new XWPFParagraph(CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody());
                    List<XWPFRun> runs = embeddedPara.getRuns();
                    for (XWPFRun r : runs) {
                        String text = r.getText(0);
                        if (text != null && text.contains(someWords)) {
                            text = text.replace(someWords, "replaced");
                            r.setText(text, 0);
                        }
                    }
                } 
                } catch (XmlException e) {
                //handle
                }
            }
    }

I think the problem is that I created a new XWPFParagraph embeddedPara and it's replacing the words of embeddedPara but not the origin paragraph. So after I write in a file, the words still not change.

How can I read and replace the words in the text box without creating a new XWPFParagraph?

KC L
  • 79
  • 1
  • 6
  • 1
    See https://stackoverflow.com/questions/35459386/change-font-size-in-text-box-apache-poi-word-docx/35462334#35462334. The problem is not the creating the new `XWPFParagraph` but the creating a `CTP` which is independent of the document. Your `XmlObject[] paraObjects` is an array of `XmlObject`s which should be `instanceof` `CTP`. So try: `embeddedPara = new XWPFParagraph((CTP)paraObjects[j], paragraph.getBody());`. Not tested - thats why a comment and not an answer. – Axel Richter Oct 18 '17 at 08:45
  • @AxelRichter Tried `embeddedPara = new XWPFParagraph((CTP)paraObjects[j], paragraph.getBody());`, give an error: `Cannot cast org.apache.xmlbeans.impl.values.XmlAnyTypeImpl to org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP`. I have read your answer before, but still don't know how to modify my code. – KC L Oct 23 '17 at 02:21
  • You are right. The problem is bigger. See my answer. – Axel Richter Oct 23 '17 at 16:50

1 Answers1

15

The problem occurs because the Word text boxes may be contained in multiple different XmlObjects dependent of the Word version. Those XmlObjects may also be in very different name spaces. So the selectChildren cannot following the name space route and so it will return a XmlAnyTypeImpl.

What all text box implementatrion have in common is that their runs are in the path .//*/w:txbxContent/w:p/w:r. So we can using a XmlCursor which selects that path. Then we collect all selected XmlObjects in a List<XmlObject>. Then we parse CTRs from those objects, which are of course only CTRs outside the document context. But we can creating XWPFRuns from those, do the replacing there and then set the XML content of those XWPFRuns back to the objects. After this we have the objects containing the replaced content.

Example:

enter image description here

import java.io.FileOutputStream;
import java.io.FileInputStream;

import org.apache.poi.xwpf.usermodel.*;

import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;

import  org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR;

import java.util.List;
import java.util.ArrayList;

public class WordReplaceTextInTextBox {

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("WordReplaceTextInTextBox.docx"));

  String someWords = "TextBox";

  for (XWPFParagraph paragraph : document.getParagraphs()) {
   XmlCursor cursor = paragraph.getCTP().newCursor();
   cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:txbxContent/w:p/w:r");

   List<XmlObject> ctrsintxtbx = new ArrayList<XmlObject>();

   while(cursor.hasNextSelection()) {
    cursor.toNextSelection();
    XmlObject obj = cursor.getObject();
    ctrsintxtbx.add(obj);
   }
   for (XmlObject obj : ctrsintxtbx) {
    CTR ctr = CTR.Factory.parse(obj.xmlText());
    //CTR ctr = CTR.Factory.parse(obj.newInputStream());
    XWPFRun bufferrun = new XWPFRun(ctr, (IRunBody)paragraph);
    String text = bufferrun.getText(0);
    if (text != null && text.contains(someWords)) {
     text = text.replace(someWords, "replaced");
     bufferrun.setText(text, 0);
    }
    obj.set(bufferrun.getCTR());
   }
  }

  FileOutputStream out = new FileOutputStream("WordReplaceTextInTextBoxNew.docx");
  document.write(out);
  out.close();
  document.close();
 }
}

enter image description here

Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • It works, and your explanation is clear. Thank you very much! – KC L Oct 24 '17 at 05:58
  • It extracts the same text twice for me. Do you know why? – Nathan B Apr 18 '18 at 15:14
  • @Nadav B: Yes because there is a `...... ...` for each text box as fall-back for backwards compatibility. But this question was about replacing text in text box and so also replacing the text in the fall-back elements is wanted behavior. – Axel Richter Apr 18 '18 at 15:29
  • How can I make sure each text extraction from a text box happens once, and the replacement happens once? – Nathan B Apr 18 '18 at 15:37
  • Well do not select `w:txbxContent` elements which are successors of `mc:Fallback`. Or do only select `w:txbxContent` elements which are not successors of `mc:Fallback`. Or do only select `w:txbxContent` elements which are successors of `w:drawing`: `cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//w:drawing/*/w:txbxContent/w:p/w:r");` – Axel Richter Apr 18 '18 at 15:58
  • This is a pretty old answer. But may I ask, how were you able to know which query to give in the `.selectPath()` function? Great solution! – Renis1235 Dec 16 '21 at 09:37
  • @Renis1235: To get knowledge about the XML of Office Open XML documents, one can simply unzip the `*.docx` (or `*.xlsx` or `*.pptx` or ...). These are simply ZIP archives. – Axel Richter Dec 16 '21 at 10:29
  • Thank you! Why did you also use the `declare namespace ...` inside the brackets though? Is there any convention that shows how to query xml files, or this specific for this Library? – Renis1235 Dec 16 '21 at 10:47
  • 1
    @Renis1235: If name spaces are used in XML, then those need to be declared, else XPATH will not work. If the question is how to do that, then the documentation mostly will be helpful. In this case: https://xmlbeans.apache.org/docs/4.0.0/guide/conSelectingXMLwithXQueryPathXPath.html. – Axel Richter Dec 16 '21 at 11:04