0

I have 2 data sets

  1. Birth-related outcomes (150k records)
  2. Causes of death (320k records)

Both these data sets have location and year in common. I'm unsure of how to merge these data sets into one. The goal is to merge these data sets and provide a visualization by comparing the birth-related outcome and causes of death based on location.

I'm trying to do this via R. But since the size is different, I couldn't proceed further.

How to merge the data with R or Excel or any suitable data cleaning tool?

James Z
  • 12,209
  • 10
  • 24
  • 44
Haree
  • 1
  • Did you try `merge(first_data_set, second_data_set)`? If the columns to merge on have the same names, that should work just fine. If the column names aren't the same, you'll have to specify which ones to match up. See `?merge` for details. – Gregor Thomas Nov 20 '17 at 15:43

1 Answers1

0

Actually, the need being merging both data sets to have columns form each, the size difference is not a problem as we are not juxtaposing columns, we are merging them based on the values of the year and location fields from each data sets. So the merging will have a multivariable merging key (or we can simply concatenate them into one variable to have a simpler key). In all cases we can have 3 different cases:

  • key=(year, location) in both D1 and D2
  • key=(year, location) in D1 but not in D2
  • key=(year, location) in D2 but not in D1

I presume that the visualisation needs the columns from both D1 and D2 (the first bullet point): an inner join of the data sets is then needed. to perform it in R, we can use the data.table package which have a very simple syntax.

library(data.table)
# Extract the data in a data.table format 
    # can be replaced by reading from a file using fread or any other command to have both data sets as data tables.
dt1 <- as.data.table(df1) 
dt2 <- as.data.table(df2)

# set the merging key in each data table
setkey(dt1, "year", "location")
setkey(dt2, "year", "location")
# merge data tables
dt.result <- merge(dt1, dt2) # keep the default settings (inner join)