How to extract WYSIWYG text from HTML?

Question

An HTML file is pleasing to the eye and human-readable, when rendered in a browser, and it's a hell to understand when it's seen raw.

Is it possible to extract text out of an HTML fragment, and convert it to a simple text file, with basic formatting?

I mean a loosy approach. Removing CSS, removing superscripts and subscripts. Only keeping as much information and text and formatting as necessary for a human to understand the new extracted text the way he would understand the original rendered HTML fragment.

P.S: I've tried to use Regular Expressions, to use inclusive approach to only select a few tags, and both soon proved to be impractical as HTML files can get really tricky.

What format do you mean? Basic `.txt` file or something like `MarkDown` — IiroP, Jun 21 '18 at 07:48
Maybe import per RSS tool/plugin and then do something with it? — herrfischer, Jun 21 '18 at 07:49
I mean something like what we can do in StackOverflow to create a formatted text. Yep, I'm talking about the markdown. — mohammad rostami siahgeli, Jun 21 '18 at 07:50
You could use [Html Agility Pack](http://html-agility-pack.net/) to extract the plain text. I imagine that for some documents it would be easy break the plain text into paragraphs with the `
` parts. A `
` could similarly have its `
` elements on separate lines starting with a dash, etc. — Andrew Morton, Jun 21 '18 at 07:53

score 1 · Answer 1 · answered Jun 21 '18 at 08:17

One option would be the Turndown JS library, which can be used either with Node or as a JS library. It converts HTML to MarkDown. It has also a demo page where you can test it.

I created a simple example with that library, which shows the output in textarea and downloads the file (see this answer):

// See https://github.com/domchristie/turndown#usage
var turndownService = new TurndownService();
var markdown = turndownService.turndown(document.getElementById('content'));

// Output to textarea for preview
var textarea = document.getElementById('out');
textarea.value = markdown;

// Download function from https://stackoverflow.com/a/18197341/5845085
function download(filename, text) {
  var element = document.createElement('a');
  element.setAttribute('href', 'data:text/plain;charset=utf-8,' + encodeURIComponent(text));
  element.setAttribute('download', filename);

  element.style.display = 'none';
  document.body.appendChild(element);

  element.click();

  document.body.removeChild(element);
}

// Download the file
download('text.md', markdown);

<div id="content" hidden>
  <h1>Title</h1>
  <p>Text text text text</p>
  <ul>
    <li>Text</li>
    <li>Text</li>
  </ul>
</div>

<textarea id="out" style="width: 80%; height: 200px;"></textarea>

<script src="https://unpkg.com/turndown/dist/turndown.js"></script>

How to extract WYSIWYG text from HTML?

1 Answers1