1

The task: taking HTML page and keeping only text from it with formatting available for simple text: so if there was <br> tag I'd like to convert it to /r/n, if there was a table - I'd like to keep the initial structure of this table in the resulting text and so on.

There are built-in PHP function strip_tags() which is not really fits my requirements as it will keep the contents of styles and scripts and will not keep the formatting deleting <br>, <table> and other tags.

I also have read the stack question 'strip html,css from string' but there's no answer I'm looking for.

Essentially I'm looking for a way to render an HTML page to TXT file (with no links and images). Is it possible? Is there any libraries doing this thing?

Community
  • 1
  • 1
Roman Matveev
  • 563
  • 1
  • 6
  • 22

1 Answers1

2

One thing you can do with this is, you can do a reverse Markdown. There are a lot of implementation of HTML to Markdown, which does the job you want. They just convert the HTML to text, including the breaks, etc.

One such implementation is html2markdown. It uses NodeJS and you just need to add this:

html2markdown("<h1>Hello markdown!</h1>")

At least, this will strip the tags and give you the result as text, that can be easily markdown-stripped, coz it has less number of characters, say #s and ---s.

There is also one more implementation of html2markdown in PHP in GitHub. The syntax is again simple:

$html = "<h3>Quick, to the Batpoles!</h3>";
$markdown = new HTML_To_Markdown($html);

And this returns you with:

echo $markdown; // ==> ### Quick, to the Batpoles!

This plugin has an ability to strip the tags too:

$html = '<span>Turnips!</span>';
$markdown = new HTML_To_Markdown($html, array('strip_tags' => true)); // $markdown now contains "Turnips!"    
Praveen Kumar Purushothaman
  • 164,888
  • 24
  • 203
  • 252