Python : How to convert markdown formatted text to text

Question

I need to convert markdown text to plain text format to display summary in my website. I want the code in python.

Not python, but you could pass it to pandoc: `pandoc --to=plain` leaves some formatting (header undelines), but not much. — naught101, May 29 '14 at 06:22

score 60 · Answer 1 · answered Feb 28 '19 at 10:44

Despite the fact that this is a very old question, I'd like to suggest a solution I came up with recently. This one neither uses BeautifulSoup nor has an overhead of converting to html and back.

The markdown module core class Markdown has a property output_formats which is not configurable but otherwise patchable like almost anything in python is. This property is a dict mapping output format name to a rendering function. By default it has two output formats, 'html' and 'xhtml' correspondingly. With a little help it may have a plaintext rendering function which is easy to write:

from markdown import Markdown
from io import StringIO


def unmark_element(element, stream=None):
    if stream is None:
        stream = StringIO()
    if element.text:
        stream.write(element.text)
    for sub in element:
        unmark_element(sub, stream)
    if element.tail:
        stream.write(element.tail)
    return stream.getvalue()


# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False


def unmark(text):
    return __md.convert(text)

unmark function takes markdown text as an input and returns all the markdown characters stripped out.

Looks great, thanks a lot for taking the time to add an answer even though the question is so old already. Much appreciated! — Frerich Raabe, Nov 04 '19 at 11:55
Thank you for this aweseom answer. I was going to implement it by myself, but this snippet saved me some good time. — Leonardo Maffei, Feb 16 '22 at 17:15
This is definitely preferable to the accepted answer! Thanks. — Hans Z, Mar 10 '22 at 15:11
And there's an unofficial [Python-Markdown](https://github.com/Python-Markdown/markdown) extenstion, [kostyachum/python-markdown-plain-text](https://github.com/kostyachum/python-markdown-plain-text), that does basically the same thing, without the monkey-patching. — Ross Patterson, Oct 10 '22 at 13:54

score 53 · Accepted Answer · edited Jul 22 '21 at 15:59

53

The Markdown and BeautifulSoup (now called beautifulsoup4) modules will help do what you describe.

Once you have converted the markdown to HTML, you can use a HTML parser to strip out the plain text.

Your code might look something like this:

from bs4 import BeautifulSoup
from markdown import markdown

html = markdown(some_html_string)
text = ''.join(BeautifulSoup(html).findAll(text=True))

edited Jul 22 '21 at 15:59

Stefan

171
1
13

answered Apr 17 '09 at 19:27

Jason Coon

17,601
10
42
50

1

it seems like convert to html.. I need to convert to plain text.. like stackoverflow, in the homepage question summary, it removes the formatting – Krish Apr 17 '09 at 19:30
1

Thanks coonj.. Good to know about BeatifulSoup – Krish Apr 18 '09 at 01:35
2

Converting back and forth from Markdown to HTML is too much, there's a good alternative below that sticks to Markdown only. – Renato Byrro Aug 08 '20 at 14:25
Good answer - this is a good solution when saving both the raw text and md/html version to db. Have not tested it, but as long as it is possible to not strip the raw text of its newlines. – Hills Nov 21 '22 at 08:25
Tested it now but it removed the newlines from the raw text. Do you know how this can be prevented? – Hills Nov 21 '22 at 09:05

score 5 · Answer 3 · answered Oct 28 '20 at 14:43

5

This is similar to Jason's answer, but handles comments correctly.

import markdown # pip install markdown
from bs4 import BeautifulSoup # pip install beautifulsoup4

def md_to_text(md):
    html = markdown.markdown(md)
    soup = BeautifulSoup(html, features='html.parser')
    return soup.get_text()

def example():
    md = '**A** [B](http://example.com) <!-- C -->'
    text = md_to_text(md)
    print(text)
    # Output: A B

answered Oct 28 '20 at 14:43

Soroush

1,055
2
18
26

Instead of BeautifulSoup you can use pypandoc. Run from ipython if the module is not found in Jupyter. – S.Doe_Dude Apr 17 '23 at 18:49

score 2 · Answer 4 · answered Apr 17 '09 at 19:42

2

Commented and removed it because I finally think I see the rub here: It may be easier to convert your markdown text to HTML and remove HTML from the text. I'm not aware of anything to remove markdown from text effectively but there are many HTML to plain text solutions.

answered Apr 17 '09 at 19:42

Rob

7,377
7
36
38

So much for praising Markdown for being "basically plain text." Might as well use Word if it's that hard to strip off. – gargoylebident Aug 20 '21 at 04:05
markdown runs on 99.99% of the computers. – panchicore Nov 18 '22 at 16:59

Neil · Answer 5 · 2023-04-27T05:39:11.873

It's not necessarily a tremendously fast solution in my limited experience, but you might try the MarkdownCorpusReader from NLTK. It requires a directory full of markdown files and a regular expression for valid file IDs.

from nltk.corpus.reader.markdown import MarkdownCorpusReader
from nltk.tokenize.treebank import TreebankWordDetokenizer
# You might also need a punkt detokenizer for the English language.

filepath = './some/path/here' 
reader = MarkdownCorpusReader(filepath, r'[w\]*\.md')

def get_text(reader: MarkdownCorpusReader, fileid: str) -> str:
    tokens = reader.words(fileids=fileid)
    # You might also need a punkt detokenizer for the English language.
    return TreebankWordDetokenizer().detokenize(tokens)

Unfortunately there are variatons on markdown, so depending where it's coming from some of the formatting elements may still be present. I can't fully test this because I don't have example data to work on. You might also need a punkt detokenizer for English. I'm not intimately familiar with the default tokenization used here, but I presume it is nltk.tokenize.word_tokenize, which uses a combination of a treebank tokenizer + an english language punkt tokenizer.

I'll add that nlkt's markdown reader is built on markdown-it-py and mdit-plain, so presumably there exist tools within those modules as well to help with this.

score 0 · Answer 6 · answered Aug 19 '23 at 08:15

As Neil suggested, nltk's parser is based on markdown-it and mdit-plain. It's quite easy to use those directly (no BeautifulSoup needed!).

pip install markdown-it-py mdit_plain

from markdown_it import MarkdownIt
from mdit_plain.renderer import RendererPlain

parser = MarkdownIt(renderer_cls=RendererPlain)

md_data = "# some markdown"
txt_data = parser.render(md_data)

score -3 · Answer 7 · edited Jan 12 '22 at 08:34

I came here while searching for a way to perform s.c. GitLab Releases via API call. I hope this matches the use case of the original questioner.

I decoded markdown to plain text (including whitespaces in the form of \n etc.) in that way:

    with open("release_note.md", 'r') as file:
        release_note = file.read()
        description = bytes(release_note, 'utf-8')
    return description.decode("utf-8")

Python : How to convert markdown formatted text to text

7 Answers7

Linked