0

I want to extract the embedded OLE objects in the RTF file. I prefer to implement it in Java, so I have read the doc & source code of Apache Tika RTFParser (1.25 & 2.0-ALPHA) and found that Tika just extracts text for search purpose and can't get the obj data. Perhaps it is feasible to write some code based on TextExtractor.

Then I also tried this post in C# but error on if (type != 3) // 3 is file, 1 is link in PackagedObject.Extract.

Can anyone help me to figure out what is the simplest way (less code) to extract obj data in the RTF file? Cross-platform and server-side solution is preferred (Java or .net-core). But using Word. Application in C# is not considerable, because it depends on the Word client and sometimes unexpectedly terminated.

Pritom Sarkar
  • 2,154
  • 3
  • 11
  • 26
Ivan Wu
  • 11
  • 3

2 Answers2

2

If you want to extract the raw bytes with Apache Tika, try the -z commandline option with Tika app or the /unpack endpoint with tika-server. Yes, Tika focuses on text/metadata extraction, but it can be used to extract raw embedded files as well.

Tim Allison
  • 615
  • 3
  • 10
0

You may find this a useful starting point: https://github.com/joniles/mpxj/blob/master/src/main/java/net/sf/mpxj/mpp/RTFEmbeddedObject.java#L149

This was intended for users of MPXJ to extract objects embedded in RTF notes.

Jon Iles
  • 2,519
  • 1
  • 20
  • 30
  • Thanks Jon. I tried the above code, but error length==-1 at `m_data = getData(blocks, length);` in RTFEmbeddedObject.RTFEmbeddedObject. I realized that, the core of this question is to parse the embedded object according to its format spec, am I right? Is there any way to know the byte length of the embedded object? – Ivan Wu Feb 18 '21 at 01:20