7

I can't remove whitespaces from a string.

My HTML is:

<p class='your-price'>
Cena pro Vás: <strong>139&nbsp;<small>Kč</small></strong>
</p>

My code is:

#encoding: utf-8
require 'rubygems'
require 'mechanize'

agent = Mechanize.new
site  = agent.get("http://www.astratex.cz/podlozky-pod-raminka/doplnky")
price = site.search("//p[@class='your-price']/strong/text()")

val = price.first.text  => "139 "
val.strip               => "139 "
val.gsub(" ", "")       => "139 "

gsub, strip, etc. don't work. Why, and how do I fix this?

val.class      => String
val.dump       => "\"139\\u{a0}\""      !
val.encoding   => #<Encoding:UTF-8>

__ENCODING__               => #<Encoding:UTF-8>
Encoding.default_external  => #<Encoding:UTF-8>

I'm using Ruby 1.9.3 so Unicode shouldn't be problem.

A.D.
  • 4,487
  • 3
  • 38
  • 50

2 Answers2

23

strip only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space.

Removing the character is easy. You can use gsub by providing a regex with the character code:

gsub(/\u00a0/, '')

You could also call

gsub(/[[:space:]]/, '')

to remove all Unicode whitespace. For details, check the Regexp documentation.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
toniedzwiedz
  • 17,895
  • 9
  • 86
  • 131
  • 2
    You could also use `\p{Space}` as an alternative to `[[:space:]]` if you prefer (I think they’re the same). – matt Jan 02 '13 at 19:29
  • 1
    An alternative is to use `gsub(' ', '')` or `gsub(' ', ' ')` before parsing and get them all in one pass. – the Tin Man Jan 02 '13 at 20:12
  • @theTinMan using `gsub` on an HTML document seems like a good idea provided that there are many elements to extract. An unnecessary effort, when there's only one... And a spectacularly bad one if you want to parse a page like this one to grab the content of your very comment :) Whoopsie Daisy, wasn't that one wrapped with ``? – toniedzwiedz Jan 02 '13 at 20:24
  • I'll try that @home, but I'm sure I tried gsub with `/\s+/` even `/\s+/u`. And why `strip` (and probably others) works only with ASCII? Programmers like me assume that Ruby will care about this automatically ;) – A.D. Jan 02 '13 at 22:01
  • 1
    @A.D. `/\s/` is ASCII-only as well – toniedzwiedz Jan 02 '13 at 22:37
  • 4
    "`Programmers like me assume that Ruby will care about this automatically`" Don't assume, educate yourself on what your language does. If the language did everything, it would be worthless for those times we need it to do something different or new. As programmers we engineer solutions from smaller pieces of code that are designed to be general purpose tools. We plug them in, use them to shape data into whatever we need, and we don't blindly "assume" things will work magically. ASCII vs. UTF-8/Unicode will be a battle for years to come, as long as the internet is full of HTML. – the Tin Man Jan 02 '13 at 22:43
  • 1
    I agree that programmer can't assume everything, but [Class: String](http://www.ruby-doc.org/core-1.9.3/String.html) says: "`A String object holds and manipulates an arbitrary sequence of bytes, typically representing characters.`" And when documentation say "`Removes leading and trailing whitespace from str.`" I assume it removes all whitespace. I have to dig deeper, maybe it's for another question. – A.D. Jan 03 '13 at 11:43
  • Leading and trailing whitespace is much different than "all whitespace" as a string can contain spaces between characters to form words. – the Tin Man Feb 14 '20 at 01:43
  • C'mon. ;) `I assume it removes all [leading/trailing] whitespace.` Even back then I understood what is whitespace. – A.D. Feb 14 '20 at 14:02
0

If I wanted to remove non-breaking spaces "\u00A0" AKA &nbsp; I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML("&nbsp;")

s = doc.text # => " "

# s is the NBSP
s.ord.to_s(16)                   # => "a0"

# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"

So tr("\u00A0", ' ') gets you where you want to be and at this point, the NBSP is now a space:

tr is extremely fast and easy to use.

An alternate is to pre-process the actual encoded character "&nbsp;" before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:

s = "&nbsp;"

s.gsub('&nbsp;', ' ') # => " "

Using a fixed string for the target is faster than using a regular expression:

s = "&nbsp;" * 10000

require 'fruity'

compare do
  fixed { s.gsub('&nbsp;', ' ') }
  regex { s.gsub(/&nbsp;/, ' ') }
 end

# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1

Regular expressions are useful if you need their capability, but they can drastically slow code.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303