I am trying to scrape the data from a webpage and I have trouble manipulating the strings. If you visit the page, you'll realize that this is a website written in French. I am trying to get the data in tabular format at the bottom of the page. In French, thousand separators are either .
or spaces
, which are used on the webpage.
Here is my code to scrap the values in the second column:
library(rvest)
link <- read_html("http://perspective.usherbrooke.ca/bilan/servlet/BMTendanceStatPays?langue=fr&codePays=NOR&codeTheme=1&codeStat=SP.POP.TOTL")
link %>%
html_nodes(".tableauBarreDroite") %>%
html_text() -> pop
head(pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
The values in the pop
vector contain the expected spaces
with the unexpected Â
. I tried the following to remove the spaces
:
new.pop <- gsub(pattern = " ", replacement = "", x = pop)
head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
The spaces
are still present in the new.pop
variable. I also tried to remove tabs instead:
new.pop <- gsub(pattern = "\n", replacement = "", x = pop)
head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
As you can see, the spaces
are not going away. Do you have any idea what I should do to transform pop
vector into a numeric vector after removing the unwanted characters?