I am trying to scrap results of Polish elections that were held this weekend, but I come to problem that before every intager random float is added.
I have tried using htmltab
, but it did not work - as you can see random number is added
library(htmltab)
url <- "https://wybory2018.pkw.gov.pl/pl/geografia/020000#results_vote_council"
tmp <- htmltab::htmltab(doc = html, which = 1)
tmp
Wyszczególnienie Liczba
2 Mieszkańców 0.972440432 755 957
3 Wyborców 0.977263472 273 653
4 Obwodów 0.99998061 940
I have checked in html what is the problem:
library(xml2)
library(rvest)
webpage <- xml2::read_html(url)
a <- webpage %>%
rvest::html_nodes("tbody")
a[1]
<tbody>\n<tr>\n<td>Mieszkańców</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.97244043</span>2 755 957</td>\n </tr>\n<tr>\n<td>Wyborców</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.97726347</span>2 273 653</td>\n </tr>\n<tr>\n<td>Obwodów</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.9999806</span>1 940</td>\n </tr>\n</tbody>"
I assume the problem is with <span class=\"hidden\">
, but how to get rid of it?
EDIT
I need the info from the 9th table with results of the parties
Nr listy Komitet wyborczy Liczba % głosów ważnych
Głosów na kandydatów komitetu Kandydatów
12 KOMITET WYBORCZY WYBORCÓW Z DUTKIEWICZEM DLA DOLNEGO ŚLĄSKA 93 260 45 8.29%
9 KOMITET WYBORCZY WYBORCÓW WOLNOŚĆ W SAMORZĄDZIE 15 499 46 1.38%
8 KOMITET WYBORCZY WYBORCÓW KUKIZ'15 53 800 41 4.78%
1 KOMITET WYBORCZY WYBORCÓW BEZPARTYJNI SAMORZĄDOWCY 168 442 46 14.98%
11 KOMITET WYBORCZY WOLNI I SOLIDARNI 9 624 38 0.86%
7 KOMITET WYBORCZY RUCH NARODOWY RP 14 874 38 1.32%
10 KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ 320 908 45 28.53%
2 KOMITET WYBORCZY POLSKIE STRONNICTWO LUDOWE 58 820 46 5.23%
6 KOMITET WYBORCZY PARTII RAZEM 18 087 44 1.61%
3 KOMITET WYBORCZY PARTIA ZIELONI 19 783 36 1.76%
5 KOALICYJNY KOMITET WYBORCZY SLD LEWICA RAZEM 61 889 46 5.50%
4 KOALICYJNY KOMITET WYBORCZY PLATFORMA.NOWOCZESNA KOALICJA OBYWATELSKA 289 831 46 25.77%
EDIT 2
I have found not the most elegant solution:
#https://stackoverflow.com/questions/7963898/extracting-the-last-n-characters-from-a-string-in-r
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
tmp <- htmltab::htmltab(doc = html, which = 9)
tmp2 <- xml2::read_html(html) %>%
rvest::html_nodes("tbody") %>%
magrittr::extract2(9) %>%
rvest::html_nodes("tr") %>%
rvest::html_nodes("td") %>%
rvest::html_nodes("span") %>%
rvest::html_text() %>%
matrix(ncol = 4, byrow = T) %>%
data.frame()
names(tmp) <- c("a", "b", "c", "d", "e", "f", "g")
tmp3 <- cbind(tmp, tmp2) %>%
mutate(n_to_delate = nchar(X1),
c1 = as.character(c),
n_whole = nchar(c1),
c2 = substrRight(c1, n_whole - n_to_delate),
c3 = gsub(" ", "", c2),
c4 = as.numeric(c3)) %>%
select(b, c4)
names(tmp3) <- c("party", "n_of_votes")