13

It's been some time now that I am trying to get myself to write a parser in Javascript for org-mode. I had no trouble at all parsing the outline (which I did in a few minutes), but parsing the actual content is far more difficult, and I'm having trouble with imbricated lists, for example.

* This is a heading
  P1 Start a paragraph here but since it is the first indentation level
the paragraph may have a lower indentation on the next line
    or a greater one for that matter.

  + LI1.1 I am beginning a list here
  + LI1.2 Here begins another list item
    which continues here
      and also here
  P2 but is broken here (this line becomes a paragraph
  outside of the first list).
  + LI2.1 P1 Second list item.
    - LI2.1.1 Inner list with a simple item
    - LI2.1.2 P1 and with an item containing several paragraphs.
      Here is the second line in the item, and now

      LI2.1.2 P2 I begin a new paragraph still in the same item. 
        The indentation can be only higher
    LI2.1 P2 but if the indentation is lower, it breaks the item, 
    (and the whole list), and this is a paragraph in the LI2.1
    list item

    - LI 2.2.1 You get the picture
  P3 Just plain text outside of the list.

(In the above example, the PX and LIX.Y are only there to show explicitly the beginning of new blocks, they would not be present in the actual document. P stand for paragraph and LI for list item. In the HTML world, PX would be the beginning of a <p> tag. The numbering are just to help keep track of the nesting and changes of list.)

I wondered about the strategy to parse this kind of significant white-space imbricated blocks, clearly I can parse line by line without any backtracking or nothing, so it must be quite simple, but for some reason I couldn't manage to do it. I tried to get inspiration from Markdown parsers, or such things that are supposed to have similar imbrication features but they appeared to me (for the ones I saw) to be very hacky, full of regexes and I hoped I could write something cleaner (org-mode "grammar" being quite huge when you come to think about it, it will grow little by little and I'd like the whole thing to be maintainable and allow to plug-in new features easily).

Can anyone with experience in parsing such things can give me some pointers?

glmxndr
  • 45,516
  • 29
  • 93
  • 118
  • AFAIK, there's no simple way to parse this. These Wiki-like formatting is just a pain in the @$$ to handle. Are you going to hand-write a parser, or are you writing/translating a grammar and let a parser generator create a parser for you? – Bart Kiers Jun 18 '11 at 14:26
  • Well, your comment suggests that I hand-code it, which I already began to do but failed to find the proper way. Maybe writing a grammar would be easier, but I wouldn't know how to deal with the significant whitespaces. I'm new to parsing, so I have bumped into a wall on everything I have tried so far. :) – glmxndr Jun 18 '11 at 16:19
  • Is there a formal grammar written out somewhere? The problem seems to me that you don't have a token to end a statement. Several languages use white-space formatting instead of semi-colons and braces, but I can't think of any that let you have formatting like your example for P1 with any degree of indentation following the first line. – Samsdram Jun 19 '11 at 23:29
  • @Samsdran : there is no grammar that I know of... – glmxndr Jun 20 '11 at 07:48

3 Answers3

10

I like parsers and compiler theory, so I have written a small parser (by hand) that is able to parse your example snippet into a XML DOM Document object. It shoul be possible to modify it so that it produces an other type of tree structure, like a custom AST (abstract syntax tree).

I've tried to keep the code easy to read, so that you can see how such a parser works.

Ask me if you need some more explanations, or want me to modify it a little.

With your example snippet as input, the statement result = new OrgModParser().parse(input); result.xml returned:

<org-mode-document indentLevel="-1">
    <section indentLevel="0">
        <header indentLevel="0">This is a heading</header>
            <paragraph indentLevel="1">P1 Start a paragraph here but since it is the first indentation level the paragraph may have a lower indentation on the next line or a greater one for that matter.</paragraph>
            <list indentLevel="1">
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.1 I am beginning a list here</paragraph>
                </list-item>
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.2 Here begins another list item which continues here and also here</paragraph>
                </list-item>
            </list>
        <paragraph indentLevel="1">P2 but is broken here (this line becomes a paragraph outside of the first list).</paragraph>
        <list indentLevel="1">
            <list-item indentLevel="1">
                <paragraph indentLevel="2">LI2.1 P1 Second list item.</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.1 Inner list with a simple item</paragraph>
                    </list-item>
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.2 P1 and with an item containing several paragraphs. Here is the second line in the item, and now</paragraph>
                        <paragraph indentLevel="3">LI2.1.2 P2 I begin a new paragraph still in the same item.  The indentation can be only higher</paragraph>
                    </list-item>
                </list>
                <paragraph indentLevel="2">LI2.1 P2 but if the indentation is lower, it breaks the item,  (and the whole list), and this is a paragraph in the LI2.1 list item</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.2.1 You get the picture</paragraph>
                    </list-item>
                </list>
            </list-item>
        </list>
        <paragraph indentLevel="1">P3 Just plain text outside of the list.</paragraph>
    </section>
</org-mode-document>

The code:

/*
 * File: orgmodparser.js
 * Basic usage: var object = new OrgModeParser().parse(input); 
 * Works on: JScript and JScript.Net.
 * - For other JavaScript platforms, just replace or override the .createRoot() method
 */

OrgModeParser = function (options) {
    if (typeof options == "object") {
        for (var i in options) {
            this[i] = options[i];
        }
    }
}

OrgModeParser.prototype = {

    "INDENT_WIDTH" :    2,  // Two spaces
    "LINE_SEPARATOR" :  "\r\n",

    /*
     * Each line in the input will be matched against this regexp.
     * Only spaces are allowed as indentation characters.
     * The symbols '*', '+' and '-' will be recognized, but only if they are followed by at least one space.
     * Add other symbols in this regexp if you want the parser to recognize them
     */
    "re" :    /^( *)([\+\-\*] +)?(.*)/,

    // This function must return a valid XML DOM document object
    createRoot :    function () {
        var err, progIDs = ["Msxml2.DOMDocument.6.0", "Msxml2.DOMDocument.5.0", "Msxml2.DOMDocument.4.0", "Msxml2.DOMDocument.3.0", "Msxml2.DOMDocument.2.0", "Msxml2.DOMDocument.1.0", "Msxml2.DOMDocument"];
        for (var i = 0; i < progIDs.length; i++) {
            try {
                return new ActiveXObject(progIDs[i]);
            }
            catch (err) {
            }
        }
        alert("Org-mode parser - Error - Failed to instantiate root object");
        return null;
    },

    parse : function (text) {

        function createNode (tagName, text) {
            var node = root.createElement(tagName);
            node.setAttribute("indentLevel", level);
            if (text) {
                var textNode = root.createTextNode(text);
                node.appendChild(textNode);
            }
            return node;
        }

        function getContainer () {
            if (lastNode.tagName == "section") { return lastNode; }
            var anc = lastNode.parentNode;
            while (anc) {
                if (modifier == "+" || modifier == "-") {
                    if (anc.getAttribute("indentLevel") == level && anc.tagName == "list") { return anc; }
                }
                if (anc.getAttribute("indentLevel") < level && anc.tagName != "paragraph") { return anc; }
                anc = anc.parentNode;
            }
            alert("Org-mode parser - Internal error at line: "+i);return null;
        }

        if (typeof text != "string") { alert("Org-mode - Type error - Input must be of type 'string'"); return null; }

        var body;
        var content;     // The text of the current line, without its indentation and modifier
        var lastNode;    // The node being processed
        var indent;      // The indentation of the current line
        var isAfterDubbleLineBreak;  // Indicates if the current line follows a dubble line break
        var line;        // The current line being processed
        var level;       // The current indentation level; given by indent.length / this.INDENT_WIDTH. Not to confuse with the nesting level 
        var lines;       // Array. Empty lines are included.
        var match;
        var modifier;    // This can be "*", "+", "-" or ""
        var root;

        isAfterDubbleLineBreak = false;
        level = -1;      // Indentation level is -1 initially; it will be 0 for the first "*"-bloc
        lines = text.split(this.LINE_SEPARATOR);
        root = this.createRoot();
        body = root.appendChild(createNode("org-mode-document"));
        lastNode = body;

        for (var i = 0; i < lines .length; i++) {
            line = lines[i];
            match = line.match(this.re);
            if (match === null) { alert("org-mode parse error at line: " + i); return null; }
            indent = match[1];
            level = indent.length / this.INDENT_WIDTH;
            modifier = match[2] && match[2].charAt(0);
            content = match[3];

            // These conditions tell the parser what to do when encountering a line with a given modifer
            if (content === "") { dubbleLineBreak(); continue; }
            else if (modifier == "+" || modifier == "-") { plus(); }
            else if (modifier == "*") { star(); }
            else if (modifier == "+") { plus(); }
            else if (modifier == "-") { minus(); }
            else if (modifier == "") { noModifier(); }
            isAfterDubbleLineBreak = false;
        }
        return root;


        function star() {
            // The '*' modifier is not allowed on an indented line
            if (indent) { alert("Org-mode parse error: unexpected '*' symbol at line " + i); return null; }
            lastNode = body.appendChild(createNode("section"));
            // The div remains the current node
            lastNode.appendChild(createNode("header", content));
        }

        function plus() {
            var container = getContainer();
            var tn = container.tagName;
            if (tn == "section" || tn == "list-item") {
                lastNode = container.appendChild(createNode("list"));
                lastNode = lastNode.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            } else if (tn == "list") {
                lastNode = container.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            }
            else alert("Org-mode parser - Internal error - Bad container tag name: " + tn);
            lastNode.setAttribute("indentLevel", Number(lastNode.getAttribute("indentLevel")) + 1);
        }

        function minus() { plus(); }

        function noModifier() {
            if (lastNode.tagName == "paragraph" && !isAfterDubbleLineBreak && (lastNode.getAttribute("indentLevel") == 1 || level >= lastNode.getAttribute("indentLevel"))) {
                lastNode.childNodes[0].appendData(" " + content);
            } else {
                var container = getContainer();
                lastNode = container.appendChild(createNode("paragraph", content));
            }
        }

        function dubbleLineBreak() {
            while (lines[i+1] && /^\s*$/.test(lines[i+1])) { i++; }
            isAfterDubbleLineBreak = true;
        }

    }
};
Luc125
  • 5,752
  • 34
  • 35
  • Not bad, that's a bit better that where I was (since it (almost) works as I expect :) ), but I was quite reluctant to rely on the DOM, first because I expect to use the parser outside of a browser, and second because some other features of org-mode can't be inserted in the DOM model as is. (I said almost as I expect, since the LI2.1.2 should have two paragraphs : praragraphs are broken on a double-new-line.) – glmxndr Jun 23 '11 at 10:45
  • Thank you for your feedback. I've corrected the dubble new line break bug for paragraphs and changed the output type to XML object so that it doesn't need a browser to work. – Luc125 Jun 23 '11 at 15:13
  • Thanks. You'll find me to be quite annoying, but relying on Windows ActiveX is not much better than relying on browsers... :) I get the gist of your proposal though. – glmxndr Jun 23 '11 at 15:33
  • 1
    No, I don't find you to be annoying... The code I posted is not a full solution working on every js platform, just an example of how to deal with imbricated blocks and whitespaces, without using many regexes. That's why I used a DOM or XML object, it allowed me to write less methods, which makes the code shorter and easier to understand. Anyway it's easy to port to any platform that has support for xml. But if you need a fully portable parser it will use a pure JS tree structure instead of a XML object. If you'd like me to write it, or post explanations about the algorithm, tell me. – Luc125 Jun 24 '11 at 09:35
  • I awarded the bounty to this answer for the given effort, which I truly appreciate. The problem I have with this answer is that it resembles a lot to where I was, and adding more features to the parser might become very messy. But as Bart states that a grammar has to be messy anyway for this kind of language, I take the answer to my question to be : "well, it's complicated, try more". :) Thanks a lot to both Luc and Bart. – glmxndr Jun 25 '11 at 07:04
7

Like I said in the comments, it'll be a pain to parse this, as are many of such Wiki-like languages.

If you're going to write a grammar and let a parser generator create a parser for you, opposed to hand-writing a parser, there are various options. To list a few:

I know ANTLR can do this, but it won't be trivial, and on top of that, you'll need to get to grips with the tool (which will take a bit of time!). I haven't spent too much time on the other two tools, but doubt they will be up to the job with such a nasty language.

Going for a hand-written parser will give you a quick start, but debugging, enhancing or rewriting it will be difficult. Writing a grammar and letting a parser generator create the parser for you will result in easier debugging, enhancing and rewriting of the parser (through the grammar), but you will need to spend (quite) some time learning to use the tool.

A hand-written parser will (most likely) be quicker than a generated one, if written properly, of course. But the difference between them might only be noticeable with large chunks of source.

Sorry, I don't have a general strategy how to handle this with a hand-written parser.

Best of luck!

Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • +1 for jison. You just need to use a good old lex/yacc port to do the parsing. – Raynos Jun 19 '11 at 23:06
  • @Raynos, ok, but this does not exactly answer how to deal with significant white-space, especially when LI items have this peculiarity that they have a different indentation for their first line and the subsequent lines (see example in the question). – glmxndr Jun 20 '11 at 06:22
  • @subtenante [Buy the dragon book](http://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools). Then read it. Then solve your problem with your amazing compiler knowledge. – Raynos Jun 20 '11 at 06:40
  • 2
    @Raynos Why type so many characters? You could just have said "Just solve your problem". :p – glmxndr Jun 20 '11 at 07:15
  • 1
    @subtenante, if you're interested, I could post a small demo of how to parse this using ANTLR (and the JavaScript target, of course). By small, I meant _not_ the full Org-Mode, but the little example snippet you posted. However, if you already know you're _not_ going to use ANTLR, I won't bother. Just let me know, and I'll try to cook something up this evening (I'm in CET). – Bart Kiers Jun 20 '11 at 07:25
  • @Bart yes, I'm interested. The example I gave is really the basics, I expect to have tons of other difficulties later, which I won't bother you with. :) But, yes, if you could save me the trouble of reading the dragon book to get me started, I'd appreciate a lot. – glmxndr Jun 20 '11 at 07:42
  • @subtenante, sorry, I'm can't do it tonight: I tried to hack up something quickly, but my JavaScript-voodoo is pretty rusty... I _will_ post a demo, but I'm not sure when I have the time to do so. – Bart Kiers Jun 20 '11 at 20:23
  • @Bart, no worries, it can wait. Thanks for taking the time anyway. :) – glmxndr Jun 20 '11 at 20:52
  • @subtenante, I have an ANTLR grammar in Java that parses this, but it uses quite a bit embedded Java code which I can't get ported to JavaScript. I won't post the grammar because it is a proper mess which would take me quite a while to explain. I knew it would be messy, but hadn't expected the mess I now see in front of me! :) – Bart Kiers Jun 24 '11 at 12:00
  • Oh, and I just saw you attached a bounty to your question which ends in two days. If no one else provides a better answer in that time, be sure to award it to Luc1245 otherwise it might automatically be awarded to me because my answer has more up-votes. – Bart Kiers Jun 24 '11 at 12:00
  • @Bart : I'm really interested in seeing the grammar anyway. Maybe you could send it to my gmail account ? (Same account name as my name here.) I'm proficient with Java so there's no problem about that. – glmxndr Jun 25 '11 at 06:58
3

There is a Javascript org-mode parser available here.

mac
  • 9,885
  • 4
  • 36
  • 51