parsing html containing (non-breaking space)

Question

I am using rvest to parse a website. I'm hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the   element in a parsed html document?

library("rvest")
library("stringr")  

minimal <- html("<!doctype html><title>blah</title> <p>&nbsp;foo")

bodytext <- minimal %>%
  html_node("body") %>% 
  html_text

Now I have extracted the body text:

bodytext
[1] " foo"

However, I can't remove that pesky bit of whitespace!

str_trim(bodytext)

gsub(pattern = " ", "", bodytext)

score 11 · Answer 1 · answered Dec 01 '14 at 21:20

11

jdharrison answered:

gsub("\\W", "", bodytext)

and, that will work but you can use:

gsub("[[:space:]]", "", bodytext)

which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters. It's a very readable alternative to other, cryptic regex classes.

answered Dec 01 '14 at 21:20

hrbrmstr

77,368
11
139
205

3

Unfortunately that latter solution, however readable, doesn't work. The problem seems to involve encoding (see my comment to @MrFlick) – AndrewMacDonald Dec 01 '14 at 21:23
However, the `\\W` technique DOES work! So apparently, whatever that space is encoded as in my locale, it ISN'T a word! – AndrewMacDonald Dec 01 '14 at 21:24
unchecked because while that does work, removing nonword characters is too extreme for my application, and I still really want to know how to match this space! – AndrewMacDonald Dec 01 '14 at 22:00

shabbychef · Answer 2 · 2015-10-29T16:16:45.357

11

I have run into the same problem, and have settled on the simple substitution of

gsub(intToUtf8(160),'',bodytext)

(Edited to correct case.)

edited Oct 29 '15 at 16:16

answered Sep 30 '15 at 23:42

shabbychef

1,940
3
16
28

score 3 · Answer 3 · answered Dec 01 '14 at 21:10

3

The &nbsp stands for "non-breaking space" which, in the unicode space, has it's own distinct character from a "regular" space (ie " "). Compare

charToRaw(" foo")
# [1] 20 66 6f 6f
charToRaw(bodytext)
# [1] c2 a0 66 6f 6f

So you'd want to use one of the special character classes for white space. You can remove all white spaces with

gsub("\\s", "", bodytext)

On Windows, I needed to make sure the encoding of the string was set properly

Encoding(bodytext) <- "UTF-8"
gsub("\\s", "", bodytext)

answered Dec 01 '14 at 21:10

MrFlick

195,160
17
277
295

that `charToRaw` function is wonderful! OK so I'd actually tried something similar. As per [this answer](http://stackoverflow.com/questions/4515117/php-parsing-problem-nbsp-and-%C3%82), the ` ` gets interpreted as "Â" and " ". The trouble is that while I could match the "Â" with a regex, I cannot do so with the space. Your encoding trick didn't help. Forgive me for not reproducing this work here; I could not get the "Â" to replicate in my example – AndrewMacDonald Dec 01 '14 at 21:21
You'll see the "Â" if you don't have the encoding properly set on the variable. What do you get if you do `Encoding(bodytext)`? You can also probably safely set it to "latin1" – MrFlick Dec 01 '14 at 21:25
2

`Encoding(bodytext)` returns `UTF-8`, yet what appears as a blank space cannot be matched by any expression that targets spaces, neither `\\s` nor `[:space:]` – AndrewMacDonald Dec 01 '14 at 21:43
You should amend your question to include the results of `sessionInfo()` which should have R version and OS version. So you're saying you don't see the "Â", you see a space but `\\s` doesn't match it? And you're testing on the example in your original post? And you get the same `charToRaw()` values as I do? – MrFlick Dec 01 '14 at 22:10

score 3 · Answer 4 · answered Feb 24 '16 at 18:53

Posting this since I think it's the most robust approach.

I scraped a Wikipedia page and got this in my output (not sure if it'll copy-paste properly):

x <- " California"

And gsub("\\s", "", x) didn't change anything, which raised the flag that something fishy is going on.

To investigate, I did:

dput(charToRaw(strsplit(x, "")[[1]][1]))
# as.raw(c(0xc2, 0xa0))

To figure out how exactly that character is stored/recognized in memory.

With this in hand, we can use gsub a bit more robustly than in the other solutions:

gsub(rawToChar(as.raw(c(0xc2, 0xa0))), "", x)
# [1] "California"

(@MrFlick's suggestion to set the encoding didn't work for me, and it's not clear where @shabbychef got the input 160 for intToUtf8; this approach can be generalized to other similar situations)

score 0 · Answer 5 · answered Dec 04 '14 at 13:13

Using rex may make this type of task a little simpler. Also I am not able to reproduce your encoding problems, the following correctly substitutes the space regardless of encoding on my machine. (It is the same solution as [[:space:]] though, so likely has the same issue for you)

re_substitutes(bodytext, rex(spaces), "", global = TRUE)

#> [1] "foo"

score 0 · Answer 6 · answered Jul 25 '17 at 15:38

0

I was able to remove   spaces at the beginning and end of strings with mystring %>% stringr::str_trim().

answered Jul 25 '17 at 15:38

jtr13

1,225
11
25

parsing html containing (non-breaking space)

6 Answers6

Linked

Related