get embedded resourses in doc files using apache tika

Question

I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code:

AutoDetectParser parser=new AutoDetectParser();
InputStream input=new FileInputStream(new File("1.docx"));
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));

parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();

I want to extract images from document and convert them to binary format. I don't know how to extract embedded resources from document.

score 6 · Accepted Answer · answered Nov 24 '13 at 18:57

6

You need to define your own class which implements Parser and attach that to the ParseContext you supply when parsing the outer document. Your Parser will then be called for all embedded resources, allowing you to save them out if you want to

The best example I can think of for this is in the Tika CLI, as used by the -z (extract) flag. If you look in the source code for TikaCLI, you're looking for the FileEmbeddedDocumentExtractor as your example.

The simplest code would be something like:

final AutoDetectParser parser = new AutoDetectParser();

public class ExtractParser extends AbstractParser {
   private int att = 0;
   public Set<MediaType> getSupportedTypes(ParseContext context) {
     // Everything AutoDetect parser does
     return parser.getSupportedTypes(context);
   }
   public void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context)
        throws IOException, SAXException, TikaException {
      // Stream to a new file
      File f = new File("out-" + (++att) + ".bin");
      FileOutputStream fout = new FileOutputStream(f);
      IOUtils.copy(strea, fout);
      fout.closee();
   }
}

InputStream input = new FileInputStream(new File("1.docx"));
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, extractParser);
parser.parse(input, handler, metadata, context);

You can also use the EmbeddedDocumentExtractor interface if you'd rather, depends on what you want to do if it's better to use Parser directly

answered Nov 24 '13 at 18:57

Gagravarr

47,320
10
111
156

and another question. in parse method is it possible to find out the format of output embedded file declared in supported type and put correct type instead of .bin? – Mohamad Ghafourian Nov 24 '13 at 19:57
1

You can look up the suggested extension from the MimeTypesRegistry – Gagravarr Nov 24 '13 at 22:36
i wonder why Tika does not support this functionality in a Parser, only in the CLI – OhadR Apr 02 '18 at 11:18
@OhadR Tika does support it from the CLI! Just use code similar to the above, customised for your own requirements! – Gagravarr Apr 02 '18 at 12:50
i meant that if i use tika as a dependencyin my code - not from the CLI - i do not have access to the "FileEmbeddedDocumentExtractor" – OhadR Apr 02 '18 at 14:06
@OhadR You don't need it! Just use the code from my answer and you're set – Gagravarr Apr 02 '18 at 20:25
but I do want the functionality of FileEmbeddedDocumentExtractor, that puts all extracted items into a different directory... this is why i suggest it should be extracted from CLI to tika-core... – OhadR Apr 03 '18 at 06:10

get embedded resourses in doc files using apache tika

1 Answers1

Linked