30

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.

If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:

\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!

So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.

Lance
  • 75,200
  • 93
  • 289
  • 503

9 Answers9

67

Actually, this is much simpler:

require 'rubygems'
require 'nokogiri'

puts Nokogiri::HTML(my_html).text

You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.

Matchu
  • 83,922
  • 18
  • 153
  • 160
  • 3
    True, but that way you do not get rid of the – ema Sep 17 '11 at 15:46
16

You could start with something like this:

require 'open-uri'
require 'rubygems'
require 'nokogiri'

uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")
artm
  • 3,559
  • 1
  • 26
  • 36
Levi
  • 4,628
  • 1
  • 18
  • 15
  • If you care to cleanup a bit more, replace the last line with this: puts doc.css('body').text.split("\n"). collect { |line| line.strip }.join("\n") – Levi Mar 24 '10 at 03:50
  • this is really close! i ran this on http://docs.google.com/View?id=dffk85xk_63f29hv2hn and it's almost perfect. It's just not including the first new line (

    or

    tags around content). Is there a way to include that?
    – Lance Mar 24 '10 at 06:13
  • 1
    hmm. nothing is jumping out at me. Are you sure you don't want to just pipe this through lynx? – Levi Mar 26 '10 at 04:18
  • 3
    Also you could get all the script and link nodes in one shot as follows: doc.css('script, link').each { |node| node.remove } – ema Sep 17 '11 at 15:44
  • Good point ema ~ I always forget how good Nokogiri is at real css selectors :) – Levi Dec 03 '11 at 16:01
9

Is simply stripping tags and excess line breaks acceptable?

html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')

First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.

Matchu
  • 83,922
  • 18
  • 153
  • 160
  • i wish, but I'd like the text to have the same spacing/breaks (the way it appears at least). If there's a double spaced line in HTML, it's converted sometimes into 5-10 line breaks using that. Updated the question. – Lance Mar 24 '10 at 03:25
  • @viatropos: Is it acceptable to simply remove redundant line breaks, then? – Matchu Mar 24 '10 at 03:29
  • then I have to build my own parser in the end :). don't really have the time to do that right now. – Lance Mar 24 '10 at 03:30
  • @viatropos: Why does removing excess line breaks require a parser? See edited answer. – Matchu Mar 24 '10 at 03:31
  • looking for something that's already solved these little issues. thanks for the help though, when I get some time later I'd be down to work through it. until then, if there's something that's ready to go that'd be awesome. – Lance Mar 24 '10 at 03:36
  • @viatropos: It's still not clear to me exactly what output you want. You might want to include an example, since it seems like you're assuming that the output you desire is the format that everyone would desire. Such is likely not the case, and you may, in fact, be forced to do your own work in this case to get exactly what you want. – Matchu Mar 24 '10 at 03:37
  • Parsing HTML using regular expression is a trap. See https://stackoverflow.com/a/1732454 for discussion. – inopinatus Sep 06 '19 at 05:04
4

I'm using the sanitize gem.

(" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")

It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.

yegor256
  • 102,010
  • 123
  • 446
  • 597
Bob Aman
  • 32,839
  • 9
  • 71
  • 95
4
require 'open-uri'
require 'nokogiri'

url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))

text = ''
doc.css('p,h1').each do |e|
  text << e.content
end

puts text

This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.

cdrev
  • 5,750
  • 4
  • 20
  • 27
3

if you are using rails you can: html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>' puts ActionView::Base.full_sanitizer.sanitize(html)

silva96
  • 605
  • 2
  • 5
  • 15
2

You want hpricot_scrub:

http://github.com/UnderpantsGnome/hpricot_scrub

You can specify which tags to strip / keep in a config hash.

Matt M.
  • 21
  • 1
0

if its in rails, you may use this:

html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe
James Tan
  • 1,336
  • 1
  • 14
  • 32
0

Building slightly on Matchu's answer, this worked for my (very similar) requirements:

html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish

Hope it makes someone's life a bit easier :-)

avjaarsveld
  • 579
  • 6
  • 9
  • "squish" might be only a Rails method. Using String::strip seems to be an adequate alternative (at least for the conversion I needed). – Lonnie Mar 26 '20 at 11:48