6

I've just got my hands on a Stackoverflow data dump, and I'm disappointed to see that the Body field of the posts is in HTML rather than Markdown. I suspect there's Markdown in the original database because that's what I see if I try to edit an answer.

I want to recover Markdown from a large set of answers. I will be processing hundreds of entries in batch mode, using either command-line tools or some kind of Lua or C library, so an interactive tool like the wmd Markdown editor is not suitable. Can people say what tools are available to help me recover Markdown from a Stackoverflow data dump?


(Related question, not a duplicate: Convert HTML back to Markdown within wmd.)

Community
  • 1
  • 1
Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533

2 Answers2

5

Markdownify converts HTML to Markdown.

See Also: MetaSO / Can Markdown be recovered from the SO data dump?

Community
  • 1
  • 1
Sampson
  • 265,109
  • 74
  • 539
  • 565
  • When it comes to using PHP on the command line, I am a troglodyte. I can't seem to figure out from the manual if there is a library function to read the entire contents of a file. Is dio_read(STDIN) on the right track? – Norman Ramsey Aug 21 '09 at 02:58
  • If you want to read the contents of a file, there are many ways - a simple function that does it is `file_get_contents();` – Sampson Aug 21 '09 at 11:09
2

take a look at pandoc:http://johnmacfarlane.net/pandoc/

there is an html2markdown tool included with pandoc that works pretty well, and the program is run from the command line, making batch conversion quite nice.

here is the man page: http://johnmacfarlane.net/pandoc/html2markdown.1.html

Mica
  • 18,501
  • 6
  • 46
  • 43