Apache POI - Retrieve text content between keywords in .doc file and conditionally render it

Question

I would like to find text content between two keywords in .doc files, and conditionally render that text content or hide it. For example:

Lorem Ipsum is simply dummy text ${if condition} of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s ${endif}

When I parse the document using the Apache - POI, I would like to be able in some way to spot in the document each and every content between these blockquotes ${if condition} ${endif} and conditionally render it or not in the next document I want to produce.

So the above text after my parsing should have the following two different forms:

1) In case the condition is satisfied

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s

or

2) In case the condition is not satisfied

Lorem Ipsum is simply dummy text

I have tried to do this by using the XWPFParagraph object and then XWPFRun but that is no way reliable way as a run can be randomly split in the middle of a word under unpredictable conditions.

Could you please propose any reliable way to achieve my use case? Thanks in advance.

Have you tried https://poi.apache.org/apidocs/dev/org/apache/poi/xwpf/usermodel/XWPFParagraph.html#getText-- ? — PJ Fanning, Sep 10 '21 at 23:39
Hi @PJFanning and thanks for your reply, yes I am aware of that function but I dont think its enough for my use case, I think I need something like a `setText` or `replaceText` function — NickAth, Sep 12 '21 at 21:32
The built in way to overcome the `Word` text-run issues is using `TextSegment`. Example: https://stackoverflow.com/questions/65275097/apache-poi-my-placeholder-is-treated-as-three-different-runs/65289246#65289246. But `XWPFParagraph.searchText` has multiple different issues until now. So I doubt there is a "reliable built in solution" for how to replace some text content by another in `Word` documents until now. — Axel Richter, Sep 15 '21 at 10:33
But why trying to replace text contents at all? This is not what one should do using word processing software. There are other possibilities to handle conditional content. Using mail merge having mail merge fields, using form fields, using content control fields... And to mark conditional text parts, one should use bookmarks instead of relying on special text contents. — Axel Richter, Sep 15 '21 at 10:34
Hi @AxelRichter and thanks for your reply, I found a workaround solution for the conditional rendering of content between my custom `condition blockquotes`, my main "glitch" remains the problem that I don't know for sure how my blockquotes will be separated into multiple `runs` and I do not know beforehand their form in order to use the `searchText` function, since the conditions will contain expressions inside them, in the following form: `${if answerId=2} blablabla.... ${endif}` or `${if questionId=5} blablabla.... ${endif}` — NickAth, Sep 15 '21 at 17:59
Pardon me if I did not make my point clear. I can provide you with more clear information and what I have done so far, thanks in advance — NickAth, Sep 15 '21 at 18:00
So the question is no longer open? Put your solution as an answer and accept it... — Queeg, Sep 17 '21 at 11:55
No, the question is open, I have not so far found anything that fits for my case... — NickAth, Sep 17 '21 at 11:57

score 2 · Accepted Answer · answered Sep 20 '21 at 13:31

Take this as an example (code is tested):

class ParagraphModifier {

    private final Pattern pIf = Pattern.compile("\\$\\{if\\s+(\\w+)\\}");
    private final Pattern pEIf = Pattern.compile("\\$\\{endif\\}");
    private final Function<String, Boolean> processor;

    public ParagraphModifier(Function<String, Boolean> processor) {
        this.processor = processor;
    }

    // Process

    static class Pair<K, V> {
        public K key;
        public V value;
        public Pair(K key, V value) {
            this.key = key;
            this.value = value;
        }
    }

    // https://stackoverflow.com/questions/23112924
    public static void cloneRun(XWPFRun clone, XWPFRun source) {
        CTRPr rPr = clone.getCTR().isSetRPr() ? clone.getCTR().getRPr() : clone.getCTR().addNewRPr();
        rPr.set(source.getCTR().getRPr());
        clone.setText(source.getText(0));
    }

    // Split runs in paragraph at a specific text offset and returns the run index
    int splitAtTextPosition(XWPFParagraph paragraph, int position) {
        List<XWPFRun> runs = paragraph.getRuns();
        int offset = 0;

        for (int i = 0; i < runs.size(); i++) {
            XWPFRun run = runs.get(i);
            String text = run.getText(0);
            int length = text.length();

            if (position >= (offset + length)) {
                offset += length;
                continue;
            }

            // Do split
            XWPFRun run2 = paragraph.insertNewRun(i + 1);
            cloneRun(run2, run);
            run.setText(text.substring(0, position - offset), 0);
            run2.setText(text.substring(position - offset), 0);
            return i + 1;
        }
        return -1;
    }

    String getParagraphText(XWPFParagraph paragraph) {
        StringBuilder sb = new StringBuilder("");
        for (XWPFRun run : paragraph.getRuns()) sb.append(run.getText(0));
        return sb.toString();
    }

    void removeRunsRange(XWPFParagraph paragraph, int from, int to) {
        int runs = paragraph.getRuns().size();
        to = Math.min(to, runs);
        for (int i = (to - 1); i >= from; i--) {
            paragraph.removeRun(i);
        }
    }

    Pair<Integer, String> extractToken(Pattern pattern, XWPFParagraph paragraph) {
        String text = getParagraphText(paragraph);
        Matcher matcher = pattern.matcher(text);

        if (matcher.find()) {
            int rStart = splitAtTextPosition(paragraph, matcher.start());
            int rEnd = splitAtTextPosition(paragraph, matcher.end());
            removeRunsRange(paragraph, rStart, rEnd);
            return new Pair<>(rStart, matcher.group());
        } else {
            return new Pair<>(-1, "");
        }
    }

    void applyParagraph(XWPFParagraph paragraph) {
        int lastIf = -1;

        while (true) {
            var tIf = extractToken(pIf, paragraph);
            if (tIf.key == -1) {
                break;
            }
            if (tIf.key < lastIf) {
                throw new IllegalStateException("If conditions can not be nested");
            }

            var tEIf = extractToken(pEIf, paragraph);
            if (tEIf.key == -1) {
                throw new IllegalStateException("If condition missing endif");
            }

            var m = pIf.matcher(tIf.value);
            var keep = m.find() && processor.apply(m.group(1));
            if (!keep) {
                removeRunsRange(paragraph, tIf.key, tEIf.key);
            }

            lastIf = tEIf.key;
        }
    }

    void apply(Iterable<XWPFParagraph> paragraphs) {
        for (XWPFParagraph p : paragraphs) {
            applyParagraph(p);
        }
    }

}

Usage:

class Main {

    private static XWPFDocument loadDoc(String name) throws IOException, InvalidFormatException {
        String path = Main.class.getClassLoader().getResource(name).getPath();
        FileInputStream fis = new FileInputStream( path);
        return new XWPFDocument(OPCPackage.open(fis));
    }

    private static void saveDoc(String path, XWPFDocument doc) throws IOException {
        try (var fos = new FileOutputStream(path)) {
            doc.write(fos);
        }
    }

    public static void main (String[] args) throws Exception {
        var xdoc = loadDoc("test.docx");

        var pm = new ParagraphModifier(str -> str.toLowerCase().equals("true"));
        pm.apply(xdoc.getParagraphs());

        saveDoc("test.out.docx", xdoc);
    }
    
}

This solution does not support ${if } blocks spanning over paragraphs, if nesting, nor Table structures. Expanding the solution to support them should be straightforward.

Brilliant solution! The `splitAtTextPosition` was exactly what I needed! Thank you so much for following up :). Did you have any experience with the Apache POI library before? — NickAth, Sep 20 '21 at 17:42
Not really, but I ended up doing similar stuff with PDFs in the past, and I know how much it hurts ;) — Newbie, Sep 20 '21 at 17:49

Apache POI - Retrieve text content between keywords in .doc file and conditionally render it

1 Answers1