5

Suppose I have two .docx files, input.docx and output.docx I need to select some of the content in input.docx and copy them to output.docx. The newdoc displays its content in the console seems correct, but I did not get anything in the output.docx, except blank lines. Can anyone provide advices?

InputStream is = new FileInputStream("D:\\input.docx"); 
XWPFDocument doc = new XWPFDocument(is);

List<XWPFParagraph> paras = doc.getParagraphs();  
List<XWPFRun> runs;
XWPFDocument newdoc = new XWPFDocument();                                     
for (XWPFParagraph para : paras) {  
            runs = para.getRuns();      
            if(!para.isEmpty())
            {
                XWPFParagraph newpara = newdoc.createParagraph(); 
                XWPFRun newrun = newpara.createRun();
                for (int i=0; i<runs.size(); i++) {                       
                    newrun=runs.get(i);
                    newpara.addRun(newrun);
                }
            }
        }


        List<XWPFParagraph> newparas = newdoc.getParagraphs(); 
        for (XWPFParagraph para1 : newparas) {  
            System.out.println(para1.getParagraphText());
        }// in the console, I have the correct information

        FileOutputStream fos = new FileOutputStream(new File("D:\\output.docx"));
        newdoc.write(fos);
        fos.flush();
        fos.close();
flyingmouse
  • 1,014
  • 3
  • 13
  • 29

1 Answers1

6

I slightly modified your code, it copies text without changing text format.

public static void main(String[] args) {
    try {
        InputStream is = new FileInputStream("Japan.docx"); 
        XWPFDocument doc = new XWPFDocument(is);

        List<XWPFParagraph> paras = doc.getParagraphs();  

        XWPFDocument newdoc = new XWPFDocument();                                     
        for (XWPFParagraph para : paras) {  

            if (!para.getParagraphText().isEmpty()) {       
                XWPFParagraph newpara = newdoc.createParagraph();
                copyAllRunsToAnotherParagraph(para, newpara);
            }

        }

        FileOutputStream fos = new FileOutputStream(new File("newJapan.docx"));
        newdoc.write(fos);
        fos.flush();
        fos.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

// Copy all runs from one paragraph to another, keeping the style unchanged
private static void copyAllRunsToAnotherParagraph(XWPFParagraph oldPar, XWPFParagraph newPar) {
    final int DEFAULT_FONT_SIZE = 10;

    for (XWPFRun run : oldPar.getRuns()) {  
        String textInRun = run.getText(0);
        if (textInRun == null || textInRun.isEmpty()) {
            continue;
        }

        int fontSize = run.getFontSize();
        System.out.println("run text = '" + textInRun + "' , fontSize = " + fontSize); 

        XWPFRun newRun = newPar.createRun();

        // Copy text
        newRun.setText(textInRun);

        // Apply the same style
        newRun.setFontSize( ( fontSize == -1) ? DEFAULT_FONT_SIZE : run.getFontSize() );    
        newRun.setFontFamily( run.getFontFamily() );
        newRun.setBold( run.isBold() );
        newRun.setItalic( run.isItalic() );
        newRun.setStrike( run.isStrike() );
        newRun.setColor( run.getColor() );
    }   
}

There's still a little problem with fontSize. Sometimes POI can't determine the size of a run (i write its value to console to trace it) and gives -1. It defines perfectly the size of font when i set it myself (say, i select some paragraphs in Word and set its font manually, either size or font family). But when it treats another POI-generated text, it sometimes gives -1. So i intriduce a default font size (10 in the above example) to be set when POI gives -1.

Another issue seems to emerge with Calibri font family. But in my tests, POI sets it to Arial by default, so i don't have the same trick with default fontFamily, as it was for fontSize.

Other font properties (Bold, italic, etc.) work well.

Probably, all these font problems are due to the fact that in my tests text was copied from .doc file. If you have .doc as input, open .doc file in Word, then "Save as.." and choose .docx format. Then in your program use only XWPFDocument instead of HWPFDocument, and i suppose it will be okay.

DenisFLASH
  • 734
  • 1
  • 9
  • 14
  • Thank you for your answer. POI is not as good as I expected. I wonder if there is some other ways to do the same thing. I run your code with a .docx file (1.03mb, 443pages, 221803words), there are 3 problems: 1. As you said, the “-1” problem. Almost all fonts are recognized as “-1”, so the output is not perfect, it also lost the information of numbers (such as 1., 2., 3.,). I followed your suggestion, to “save as” a doc file to a .docx file, the problem still exists. – flyingmouse Aug 06 '14 at 03:53
  • 2. It takes more than 40 minutes to process the document (My computer is Inter-i7-3770 with 4G RAM). I have hundreds of similar documents, it will be too time-consuming. 3. The `output.docx` cannot directly opened. It says cannot open Office Open XML document output.docx. The File has errors. Fortunately, I can open it by restore the file to a new file. Anyway, thank you very much for your answer. I will keep searching for better solutions. – flyingmouse Aug 06 '14 at 03:53
  • @flyingmouse Thank you for a green mark. But i'm not completely satisfied, as you still have big problems. So let's try to see what we can do. Ouch, 40 minutes? That's huge! I've never worked with big files, and that's why i never really cared about time optimization. But now the time has come ;-) This code above is not perfect at all, i'll try to optimize it. And here i need your help: i would like to face the same problems as you do, so please tell me the sequence of actions you perform from the very beginning: you take .doc from the website, and what next? – DenisFLASH Aug 06 '14 at 06:36
  • @flyingmouse unfortunately, this code is not supposed to keep the numeration "1. 2. 3.". For a moment, i don't hav a code which treats it. At work i'm currently developing one utility which will treat .docx and filter the chapters: user selects which chapters/subchapters to include, and the program re-builds the .docx to include only the selected chapters, refreshing the numeration. But i'm still at the stage of writing requirements and designing the architecture. On Friday i leave for 2 week vacation, and it's only at the end of August that i will write a code. Is your task urgent? – DenisFLASH Aug 06 '14 at 06:45
  • @flyingmouse concerning other technologies, i've only heard of Jasper Reports, but i can't say if it's really convenient to apply it here. POI is not ideal, that's true. But, maybe the problem is due to Microsoft themselves, as they made a huuuuge change between .doc and .docx formats. Try one trick: rename any `example.docx` to `example.zip`, then **decompress** it. You'll see a file structure with lots of files as document.xml, styles.xml, etc. That's how Microsoft stores data in .docx, it's completely different from .doc. That's why POI's HWPF and XWPF are almost incompatible – DenisFLASH Aug 06 '14 at 06:53
  • thank you so much. It is difficult to write comments here, and I don't know your email, so I create a gmail account and leave a message for you in the Inbox. The gmail account: stackoverflowflyingmouse@gmail.com, the password is `stackoverflow`. Thank you :) (my email address is flyingmouse820 at gmail dot com) – flyingmouse Aug 06 '14 at 07:44
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/58754/discussion-between-flyingmouse-and-denisflash). – flyingmouse Aug 06 '14 at 07:51
  • @flyingmouse wow) account is great, but can you just send the same message to the mail shown in my StackOverflow profile (see personal info section, near avatar). I think it's easier – DenisFLASH Aug 06 '14 at 07:51