1

I am creating a graph that correlates the life expectancy age and the state pension age for each country. I have used web scraping packages to scrape 2 datasets from 2 Wikipedia pages.

One of the datasets contains the column "Country" and the other dataset contains the column "Country and regions". This is a problem because both datasets need to merge but are unbalanced due to the regions in the "Country and regions" column.

To solve this, I need to remove the regions in "Country and regions", before merging the datasets, so it is balanced. I need to find the unmatched data from "Country and regions" with "Country", remove it, and create one data frame with the 2 datasets.

library(xml2)
library(rvest)
library(stringr)

urlLifeExpectancy <- "https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy"

extractedLifeData = urlLifeExpectancy %>%
  read_html() %>%
  html_node(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>%
  html_table(fill = TRUE)

urlPensionAge <- "https://en.wikipedia.org/wiki/Retirement_age#Retirement_age_by_country"

extractedPensionData = urlPensionAge %>%
  read_html() %>%
  html_node(xpath = '//*[@id="mw-content-text"]/div/table[3]') %>%
  html_table(fill = TRUE)
Nick
  • 55
  • 1
  • 7
  • does this help? `merge(extractedLifeData, extractedPensionData, by.y = "Country", by.x = "Country and regions", all = TRUE)` – Ronak Shah May 26 '19 at 11:46
  • This merges the two datasets into one and doesn't remove the regions, it just combines "Country" with Country and regions". Because of this, there will be no pension data for the regions. I am only looking to keep the countries. – Nick May 26 '19 at 11:58
  • I am not clear what your expected output would look like? How many and which columns will it have? `extractedLifeData[extractedLifeData$\`Country and regions\` %in% extractedPensionData$Country, ]` removes the rows which are not present in `Country` column. – Ronak Shah May 26 '19 at 12:03
  • The outcome is meant to display the following columns. "Country", "Men", "Women" (from extractedPensionData), and "Male life expectancy", "Female life expectancy" (from extractedLifeData). – Nick May 26 '19 at 12:09
  • `merge(extractedLifeData[c(1, 5, 7)], extractedPensionData[1:3], by.y = "Country", by.x = "Country and regions")` . this? – Ronak Shah May 26 '19 at 12:16
  • Perfect, if you put it in the answer I'll mark as solved – Nick May 26 '19 at 12:20

2 Answers2

1

We can use join from data.table

library(data.table)
setDT(extractedLifeData[c(1, 5, 7)][extractedPensionDate[1:3],
       on = .(Country = `Country and regions`)]
akrun
  • 874,273
  • 37
  • 540
  • 662
0

We can use merge by selecting the columns which we need from both the datasets

merge(extractedLifeData[c(1, 5, 7)], extractedPensionData[1:3], 
       by.y = "Country", by.x = "Country and regions")

Or use inner_join from dplyr

library(dplyr)

extractedLifeData %>% select(1, 5, 7) %>%
     inner_join(extractedPensionData %>% select(1:3), 
                by = c("Country and regions" = "Country"))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213