10

I am using rvest to parse a website. I'm hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the   element in a parsed html document?

library("rvest")
library("stringr")  

minimal <- html("<!doctype html><title>blah</title> <p>&nbsp;foo")

bodytext <- minimal %>%
  html_node("body") %>% 
  html_text

Now I have extracted the body text:

bodytext
[1] " foo"

However, I can't remove that pesky bit of whitespace!

str_trim(bodytext)

gsub(pattern = " ", "", bodytext)
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
AndrewMacDonald
  • 2,870
  • 1
  • 18
  • 31

6 Answers6

11

jdharrison answered:

gsub("\\W", "", bodytext)

and, that will work but you can use:

gsub("[[:space:]]", "", bodytext)

which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters. It's a very readable alternative to other, cryptic regex classes.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • 3
    Unfortunately that latter solution, however readable, doesn't work. The problem seems to involve encoding (see my comment to @MrFlick) – AndrewMacDonald Dec 01 '14 at 21:23
  • However, the `\\W` technique DOES work! So apparently, whatever that space is encoded as in my locale, it ISN'T a word! – AndrewMacDonald Dec 01 '14 at 21:24
  • unchecked because while that does work, removing nonword characters is too extreme for my application, and I still really want to know how to match this space! – AndrewMacDonald Dec 01 '14 at 22:00
11

I have run into the same problem, and have settled on the simple substitution of

gsub(intToUtf8(160),'',bodytext)

(Edited to correct case.)

shabbychef
  • 1,940
  • 3
  • 16
  • 28
3

The &nbsp stands for "non-breaking space" which, in the unicode space, has it's own distinct character from a "regular" space (ie " "). Compare

charToRaw(" foo")
# [1] 20 66 6f 6f
charToRaw(bodytext)
# [1] c2 a0 66 6f 6f

So you'd want to use one of the special character classes for white space. You can remove all white spaces with

gsub("\\s", "", bodytext)

On Windows, I needed to make sure the encoding of the string was set properly

Encoding(bodytext) <- "UTF-8"
gsub("\\s", "", bodytext)
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • that `charToRaw` function is wonderful! OK so I'd actually tried something similar. As per [this answer](http://stackoverflow.com/questions/4515117/php-parsing-problem-nbsp-and-%C3%82), the ` ` gets interpreted as "Â" and " ". The trouble is that while I could match the "Â" with a regex, I cannot do so with the space. Your encoding trick didn't help. Forgive me for not reproducing this work here; I could not get the "Â" to replicate in my example – AndrewMacDonald Dec 01 '14 at 21:21
  • You'll see the "Â" if you don't have the encoding properly set on the variable. What do you get if you do `Encoding(bodytext)`? You can also probably safely set it to "latin1" – MrFlick Dec 01 '14 at 21:25
  • 2
    `Encoding(bodytext)` returns `UTF-8`, yet what appears as a blank space cannot be matched by any expression that targets spaces, neither `\\s` nor `[:space:]` – AndrewMacDonald Dec 01 '14 at 21:43
  • You should amend your question to include the results of `sessionInfo()` which should have R version and OS version. So you're saying you don't see the "Â", you see a space but `\\s` doesn't match it? And you're testing on the example in your original post? And you get the same `charToRaw()` values as I do? – MrFlick Dec 01 '14 at 22:10
3

Posting this since I think it's the most robust approach.

I scraped a Wikipedia page and got this in my output (not sure if it'll copy-paste properly):

x <- " California"

And gsub("\\s", "", x) didn't change anything, which raised the flag that something fishy is going on.

To investigate, I did:

dput(charToRaw(strsplit(x, "")[[1]][1]))
# as.raw(c(0xc2, 0xa0))

To figure out how exactly that character is stored/recognized in memory.

With this in hand, we can use gsub a bit more robustly than in the other solutions:

gsub(rawToChar(as.raw(c(0xc2, 0xa0))), "", x)
# [1] "California"

(@MrFlick's suggestion to set the encoding didn't work for me, and it's not clear where @shabbychef got the input 160 for intToUtf8; this approach can be generalized to other similar situations)

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
0

Using rex may make this type of task a little simpler. Also I am not able to reproduce your encoding problems, the following correctly substitutes the space regardless of encoding on my machine. (It is the same solution as [[:space:]] though, so likely has the same issue for you)

re_substitutes(bodytext, rex(spaces), "", global = TRUE)

#> [1] "foo"
Jim
  • 4,687
  • 29
  • 30
0

I was able to remove &nbsp; spaces at the beginning and end of strings with mystring %>% stringr::str_trim().

jtr13
  • 1,225
  • 11
  • 25