14

is there any way I can convert HTML into proper plain text? I tried everything from raw to sanitize and even the Mail gem with it's text_part method which is supposed to do exactly that but doesn't work for me.

My best shot so far was strip_tags(strip_links(resource.body)) but <p>, <ul> etc. were not correctly converted.

This is more or less what I have in HTML:

Hello

This is some text. Blah blah blah.

Address:
John Doe
10 ABC Street
Whatever City

New Features
- Feature A
- Feature B
- Feature C
Check this out: http://www.google.com

Best,
Admin

which converts to something like

Hello
This is some text. Blah blah blah.
Address: John Doe 10 ABC Street Whatever City

New Features Feature A Feature B Feature C
Check this out: http://www.google.com

Best, Admin

Any idea?

Cojones
  • 2,930
  • 4
  • 29
  • 41
  • 3
    try this `require 'rubygems' require 'nokogiri'puts Nokogiri::HTML(my_html).text` – Amar Sep 18 '13 at 08:25
  • Unfortunately the same result, however I found a solution. Will post here soon! – Cojones Sep 18 '13 at 08:41
  • 1
    Possible duplicate of [HTML to Plain Text with Ruby?](http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby) – mb21 Jan 04 '17 at 13:46

2 Answers2

22

Rails 4.2.1 has #strip_tags, a built-in method especially for stripping HTML tags.

Some examples:

strip_tags("Strip <i>these</i> tags!")

=> Strip these tags!

strip_tags("<b>Bold</b> no more!  <a href='more.html'>See more here</a>...")

=> Bold no more! See more here...

strip_tags("<div id='top-bar'>Welcome to my website!</div>")

=> Welcome to my website!

Check it out in the API docs.

Drenmi
  • 8,492
  • 4
  • 42
  • 51
klaoha06
  • 325
  • 2
  • 7
  • 1
    In order to test the above samples in a console, you'll have to include the helper by issuing the following command in a console: `include ActionView::Helpers::SanitizeHelper` – Tass Jan 07 '16 at 18:47
  • 3
    This does not take care of things like ` ` that most WYSIWYG editors seem to use. – tolgap Mar 15 '16 at 15:52
  • 1
    You can easily remove `\n` with `split.join`. Here is what I use outside of a view `ActionController::Base.helpers.strip_tags(response.body).split.join' '` – duleorlovic Aug 30 '18 at 08:47
9

Found the solution here: https://github.com/alexdunae/premailer/blob/master/lib/premailer/html_to_plain_text.rb

Works like a charm!

Cojones
  • 2,930
  • 4
  • 29
  • 41
  • Works quite well except if you have tables: no blank separation between cell contents – jogaco Mar 30 '16 at 10:45
  • 1
    note that it parses html with regex H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ – pudiva Feb 25 '20 at 18:33