HTML to Plain Text with Ruby?

Question

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.

If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:

\n\n\n\n\nh1. Test&nbsp;h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!

So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.

If you are using Rails, you can use strip_tags, prior to that just run a Gsub replacing
with \n — denysonique, Jun 18 '11 at 06:05

score 67 · Accepted Answer · answered Mar 24 '10 at 03:35

67

Actually, this is much simpler:

require 'rubygems'
require 'nokogiri'

puts Nokogiri::HTML(my_html).text

You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.

answered Mar 24 '10 at 03:35

Matchu

83,922
18
153
160

3

True, but that way you do not get rid of the – ema Sep 17 '11 at 15:46

score 16 · Answer 2 · edited Feb 09 '13 at 13:31

16

You could start with something like this:

require 'open-uri'
require 'rubygems'
require 'nokogiri'

uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")

edited Feb 09 '13 at 13:31

artm

3,559
1
26
36

answered Mar 24 '10 at 03:36

Levi

4,628
1
18
15

If you care to cleanup a bit more, replace the last line with this: puts doc.css('body').text.split("\n"). collect { |line| line.strip }.join("\n") – Levi Mar 24 '10 at 03:50
this is really close! i ran this on http://docs.google.com/View?id=dffk85xk_63f29hv2hn and it's almost perfect. It's just not including the first new line (
or
tags around content). Is there a way to include that?
– Lance Mar 24 '10 at 06:13
1

hmm. nothing is jumping out at me. Are you sure you don't want to just pipe this through lynx? – Levi Mar 26 '10 at 04:18
3

Also you could get all the script and link nodes in one shot as follows: doc.css('script, link').each { |node| node.remove } – ema Sep 17 '11 at 15:44
Good point ema ~ I always forget how good Nokogiri is at real css selectors :) – Levi Dec 03 '11 at 16:01

Matchu · Answer 3 · 2010-03-24T03:29:34.717

9

Is simply stripping tags and excess line breaks acceptable?

html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')

First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.

edited Mar 24 '10 at 03:29

answered Mar 24 '10 at 03:16

Matchu

83,922
18
153
160

i wish, but I'd like the text to have the same spacing/breaks (the way it appears at least). If there's a double spaced line in HTML, it's converted sometimes into 5-10 line breaks using that. Updated the question. – Lance Mar 24 '10 at 03:25
@viatropos: Is it acceptable to simply remove redundant line breaks, then? – Matchu Mar 24 '10 at 03:29
then I have to build my own parser in the end :). don't really have the time to do that right now. – Lance Mar 24 '10 at 03:30
@viatropos: Why does removing excess line breaks require a parser? See edited answer. – Matchu Mar 24 '10 at 03:31
looking for something that's already solved these little issues. thanks for the help though, when I get some time later I'd be down to work through it. until then, if there's something that's ready to go that'd be awesome. – Lance Mar 24 '10 at 03:36
@viatropos: It's still not clear to me exactly what output you want. You might want to include an example, since it seems like you're assuming that the output you desire is the format that everyone would desire. Such is likely not the case, and you may, in fact, be forced to do your own work in this case to get exactly what you want. – Matchu Mar 24 '10 at 03:37
Parsing HTML using regular expression is a trap. See https://stackoverflow.com/a/1732454 for discussion. – inopinatus Sep 06 '19 at 05:04

score 4 · Answer 4 · edited May 07 '19 at 04:06

4

I'm using the sanitize gem.

(" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")

It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.

edited May 07 '19 at 04:06

yegor256

102,010
123
446
597

answered Oct 24 '13 at 11:14

Bob Aman

32,839
9
71
95

score 4 · Answer 5 · answered Feb 26 '14 at 13:55

require 'open-uri'
require 'nokogiri'

url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))

text = ''
doc.css('p,h1').each do |e|
  text << e.content
end

puts text

This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.

score 3 · Answer 6 · answered Jun 02 '17 at 21:42

3

if you are using rails you can: html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>' puts ActionView::Base.full_sanitizer.sanitize(html)

answered Jun 02 '17 at 21:42

silva96

605
2
5
15

score 2 · Answer 7 · answered Mar 24 '10 at 03:53

2

You want hpricot_scrub:

http://github.com/UnderpantsGnome/hpricot_scrub

You can specify which tags to strip / keep in a config hash.

answered Mar 24 '10 at 03:53

Matt M.

21
1

score 0 · Answer 8 · answered Aug 14 '15 at 04:18

0

if its in rails, you may use this:

html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe

answered Aug 14 '15 at 04:18

James Tan

1,336
1
14
32

score 0 · Answer 9 · answered Jan 08 '16 at 16:32

0

Building slightly on Matchu's answer, this worked for my (very similar) requirements:

html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish

Hope it makes someone's life a bit easier :-)

answered Jan 08 '16 at 16:32

avjaarsveld

579
6
9

"squish" might be only a Rails method. Using String::strip seems to be an adequate alternative (at least for the conversion I needed). – Lonnie Mar 26 '20 at 11:48

HTML to Plain Text with Ruby?

9 Answers9

Linked