10

Does there exist a PHP/Ruby library or a web-service that enables programmatic extraction of information from Microsoft Onenote documents?

The solution is to be implemented in a web application backend.

I am not looking for windows specific solutions. Also I am not looking for solutions that require users to download application extensions or installable softwares.

Till
  • 22,236
  • 4
  • 59
  • 89
lorefnon
  • 12,875
  • 6
  • 61
  • 93
  • The closest thing is this request https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 in the Apache POI project. You could ask there if it will be implemented someday. Then you could use it un Java or JRuby via Tika also – brutuscat Sep 07 '12 at 09:06
  • Can you possibly share an example file? I had no idea what OneNote is, but from reading the WP entry (I added the link to it), it sounds similar to OpenOffice's format. – Till Sep 11 '12 at 17:54

3 Answers3

4

Here's a cross platform one-note parser. (.one -> .html) It's pretty primitive, but it's open source and may get you going

https://github.com/dropbox/onenote-parser in case that helps you parse the file format.

Feel free to use it (apache license)

hellcatv
  • 573
  • 4
  • 21
  • This is **exactly** what I was looking for ... 6 years ago. But, nonetheless, good to know something like this exists. – lorefnon Mar 11 '17 at 09:48
2

Easy solution

You could easily write your own extractor utility in C# using the Microsoft.Office.Interop.OneNote API.

You can find a detailed walkthrough in this msdn article, then you could access the content with a code similar to this:

using System;
using System.Linq;
using System.Xml.Linq;
using Microsoft.Office.Interop.OneNote;

class Program
{
  static void Main(string[] args)
  {
    var onenoteApp = new Application();

    string notebookXml;
    onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml);

    var doc = XDocument.Parse(notebookXml);
    var ns = doc.Root.Name.Namespace;
    var pageNode = doc.Descendants(ns + "Page").Where(n => 
      n.Attribute("name").Value == "Test page").FirstOrDefault();
    if (pageNode != null)
    {
      string pageXml;
      onenoteApp.GetPageContent(pageNode.Attribute("ID").Value, out pageXml);
      Console.WriteLine(XDocument.Parse(pageXml));
    }
  }
}

You can read the api documentation here, which also contains a few examples.

Low level approach

In the case your environment does not allow to use this official library, then I don't know of a unix port, but an Office document is stored in XML format. You only need an XML parser to extract the information you need. Here you have the OneNote format specification. (there is a pdf link to the latest update at the top) You may then use the parser of your choice and create your little utility. My suggestion for ruby would be libxml.

I hope this suits your needs.

chipairon
  • 2,031
  • 2
  • 19
  • 21
  • While I appreciate the effort you took, I already specified very clearly that I am not looking forward to windows specific solutions. – lorefnon Nov 10 '12 at 15:52
  • As you stated a service will do it, you should be able to create a web service in windows and use the already implemented, official and tested library from Microsoft. Then consume this service from your "non windows specific" application. That would be the easier solution and you will be done for the day. – chipairon Nov 12 '12 at 10:53
0

Best bet is to learn how to do XML parsing in PHP/Ruby and analyse OneNote documents to figure out how they're structured. Once you figure the .one files out, you can use PHP to extract the required information from it. Check this link out, might help you.

Hassan Khan
  • 766
  • 3
  • 9
  • 21