I am working with the R programming language.
I am trying to webscrape the second table from wikipedia.
Below, I outline the strategy I used in two different methods (Method 1, Method 2) I attempted while trying to scrape this table:
# METHOD 1
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_municipalities_in_Ontario"
html <- read_html(url)
final <- data.frame(html %>%
html_element("table.wikitable.sortable") %>%
html_table())
> dim(final)
[1] 33 7
In Method 1, the code seemed to run, but the table appears to be a lot "smaller" (i.e. fewer rows) than the actual table on the wikipedia page.
I then tried the following code:
# METHOD 2
library(httr)
library(XML)
r <- GET(url)
final <- readHTMLTable(
doc=content(r, "text"))
In Method 2, the table appears to be significantly "bigger" than the previous result (I am still not sure if all the rows of the table were included):
111 9,545 9,631 -0.9% 555.96 17.2/km2
[ reached 'max' / getOption("max.print") -- omitted 307 rows ]
But when I tried to save the results of Method 2 as a data frame, I get the following error:
final = data.frame(final)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 34, 418, 14, 8, 4
Can someone please show me what I am doing wrong and how I can fix this?
Thanks!
References: