Is there any module that can parse restructuredtext into a tree model?
Can docutils or sphinx do this?
Is there any module that can parse restructuredtext into a tree model?
Can docutils or sphinx do this?
I'd like to extend upon the answer from Gareth Latty. "What you probably want is the parser at docutils.parsers.rst
" is a good starting point of the answer, but what's next? Namely:
How to parse restructuredtext in python?
Below is the exact answer for Python 3.6 and docutils 0.14:
import docutils.nodes
import docutils.parsers.rst
import docutils.utils
import docutils.frontend
def parse_rst(text: str) -> docutils.nodes.document:
parser = docutils.parsers.rst.Parser()
components = (docutils.parsers.rst.Parser,)
settings = docutils.frontend.OptionParser(components=components).get_default_values()
document = docutils.utils.new_document('<rst-doc>', settings=settings)
parser.parse(text, document)
return document
And the resulting document can be processed using, for example, below, which will print all references in the document:
class MyVisitor(docutils.nodes.NodeVisitor):
def visit_reference(self, node: docutils.nodes.reference) -> None:
"""Called for "reference" nodes."""
print(node)
def unknown_visit(self, node: docutils.nodes.Node) -> None:
"""Called for all other node types."""
pass
Here's how to run it:
doc = parse_rst('spam spam lovely spam')
visitor = MyVisitor(doc)
doc.walk(visitor)
Docutils does indeed contain the tools to do this.
What you probably want is the parser at docutils.parsers.rst
See this page for details on what is involved. There are also some examples at docutils/examples.py
- particularly check out the internals()
function, which is probably of interest.
Based on Gareth Latty's and mbdevpl's answers here is an update for newer versions of docutils.
Starting with docutils 0.18 (2021-10-26), docutils.frontend.OptionParser
has been deprecated (git mirror commit, upstream SVN HISTORY.txt), and the following warning will be printed (source):
DeprecationWarning: The frontend.OptionParser class will be replaced by a subclass of argparse.ArgumentParser in Docutils 0.21 or later.
The docutils.frontend.get_default_settings()
function can be used instead, but it was only added in docutils 0.18, so to be compatible with all versions without getting warnings, you can use:
import docutils.parsers.rst
import docutils.utils
import docutils.frontend
def parse_rst(text: str) -> docutils.nodes.document:
parser = docutils.parsers.rst.Parser()
if hasattr(docutils.frontend, 'get_default_settings'):
# docutils >= 0.18
settings = docutils.frontend.get_default_settings(docutils.parsers.rst.Parser)
else:
# docutils < 0.18
settings = docutils.frontend.OptionParser(components=(docutils.parsers.rst.Parser,)).get_default_values()
document = docutils.utils.new_document('<rst-doc>', settings=settings)
parser.parse(text, document)
return document
The rest of the code stays the same and can be found in mbdevpl's answer.
There is a more high-level interface to Docutils in the
docutils.core
module.
To parse a string of reStructuredText into a document tree, do, e.g.,
from docutils.core import publish_doctree
source = 'Hello *world*'
tree = publish_doctree(source)
For details, see https://docutils.sourceforge.io/docs/api/publisher.html