1

I am messing with apache poi to manipulate word document. Is there any way to get headings from a doc file? i am able to get plain text from the doc but I need to differentiate all headings from the document file?. IS any function available in apache poi api to get only headings from the ms word file??

Colin 't Hart
  • 7,372
  • 3
  • 28
  • 51
Stunner
  • 961
  • 2
  • 19
  • 39
  • Are you looking for text which has a style like `Heading 1` applied to it, or some algorithm which spots that one line of text is bold + 10 points bigger than the surrounding, so probably is some sort of heading? – Gagravarr Oct 30 '13 at 11:59
  • Both. Basically I need to retreive all the headings(sub-headings too) from the doc file. Instead of writing our own algorithm based on font size and boldness , I'm Looking for a method in API which gets all headings from the doc file. – Stunner Nov 01 '13 at 04:31
  • 1
    POI can tell you what things have a Heading style applied to them, but that won't help in the common case when people don't mark it as a heading, and instead just bump up the font size... – Gagravarr Nov 01 '13 at 06:50
  • OK..I think the problem itself is not generalized... – Stunner Nov 01 '13 at 08:06

3 Answers3

2

Promoting a comment to an answer

There are two ways to make a "Heading" in Word. The "proper" way, and the way that most people seem to do it...

  1. In the styles dropdown, pick the appropriate header style, write your text, then go back to the normal paragraph style for the next line

  2. Highlight a line, and bump up the font size + make it bold or italic

If your users are doing #2, you've basically no real hope of identifying the Headings. Short of writing some fuzzy matching logic to try to spot when the font size jumps, you're out of luck

For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with Heading (case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.

If you look at the Apache Tika MS-Word parser which is built on top of POI, you'll see a good example there of iterating over the paragraphs and checking the styles

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
2

just as Gagravarr saying:

For #1, it's fairly easy in Apache POI. What you'll want to do is grab the style description of the style that applies to a paragraph, then get the name of the style. If that starts with Heading (case insensitive), you know you've found a heading. Get the text of that paragraph, and move on through the document.

using Apache POI code like this :

        File f=new File("test.docx");
        FileInputStream fis = new FileInputStream(f);
        XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
        XWPFStyles styles=xdoc.getStyles();         
        List<XWPFParagraph> xwpfparagraphs =xdoc.getParagraphs();
        System.out.println();
        for(int i=0;i<xwpfparagraphs.size();i++)
        {
            System.out.println("paragraph style id "+(i+1)+":"+xwpfparagraphs.get(i).getStyleID());                         
            if(xwpfparagraphs.get(i).getStyleID()!=null)
            {
                String styleid=xwpfparagraphs.get(i).getStyleID();
                XWPFStyle style=styles.getStyle(styleid);
                if(style!=null)
                {
                    System.out.println("Style name:"+style.getName());
                    if(style.getName().startsWith("heading"))
                    {
                        //this is a heading
                    }
                }

            }


        }
Opaka
  • 21
  • 2
0

At least for HWPF (i.e. the old binary doc format) and if you have a properly formatted file (so type #1 of the other answers) you should not rely exclusively on the style name - in fact, this may be a language-dependent value ("Heading" in English, "Titre" in French, etc.).

Paragraph.getLvl(), which encodes the level where the respective paragraph is shown in Word's outline view, often makes a good secondary source. 1 constitutes the most significant level, all subsequent numbers up to 8 stand for less significant heading candidates and 9 is the value that Word assigns to ordinary (non-heading) paragraphs by default.

morido
  • 1,027
  • 7
  • 24