I have five datasets, which cover the same topics over time.
library(data.table)
DT <- data.table(A= round(rnorm(10,10,10),2),
B= round(rnorm(10,10,10),2),
C= round(rnorm(10,10,10),2))
DT_2 <- data.table(A= round(rnorm(10,10,10),2),
B= round(rnorm(10,10,10),2),
C= round(rnorm(10,10,10),2),
D= round(rnorm(10,10,10),2))
DT_3 <- DT
DT_4 <- DT_2
DT_5 <- DT_2
names(DT) <- c("something","nothing", "anything")
names(DT_2) <- c("some thing","no thing", "any thing", "number4")
names(DT_3) <- c("some thing wrong","nothing", "anything_")
names(DT_4) <- c("something","nothingg", "anything", "number_4")
names(DT_5) <- c("something","nothing", "anything happening", "number4")
However, each year, they are a little different. The names of the columns have slightly changed, some columns have been added and some have been removed. I would like to "rowbind" these datasets. However, every dataset has about 100 columns, and manually making all the column names consistent would be hell.
EDIT: Please note that the columns do not necessarily have the same index, such as is the case for example in the edited column names below, where DT_2
has the column XXX
.
# EDIT
names(DT)<- c("something","nothing", "anything", "number4")
names(DT_2)<- c("some thing","no thing","XXX", "number4")
names(DT_3)<- c("some thing wrong","nothing", "anything_")
names(DT_4)<- c("something","nothingg", "anything", "number_4")
names(DT_5)<- c("something","nothing", "anything happening", "number4")
I thought it might be a better idea to write a function to do it for me.
I once asked some help with a function that did something similar here. The following function coalesces columns with capitalised and non capitalised versions of variables names without specifying the variable names.
Very neatly, it additionally specifies which var names were coalesced.
library(data.table)
library(magrittr) # piping is used to improve readability
names(DT_panel) %>%
data.table(orig = ., lc = tolower(.)) %>%
.[, {
if (.N > 1L) {
new <- toupper(.BY)
old <- setdiff(orig, new)
DT_panel[, (new) := fcoalesce(.SD), .SDcols = orig]
DT_panel[, (old) := NULL]
sprintf("Coalesced %s onto %s", toString(old), new)
}
}, by = lc]
In addition, I found this question here, which does a fuzzy join based on column entries.
library(fuzzyjoin); library(dplyr);
stringdist_join(a, b,
by = "name",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(name.x) %>%
top_n(1, -dist)
The problem is that I don't under stand either of these solution well enough to combine them in to a function providing the solution that I want.
Could anyone help me make a start? My desired output is as follows:
DT <- data.table(A= round(rnorm(10,10,10),2),
B= round(rnorm(10,10,10),2),
C= round(rnorm(10,10,10),2))
DT_2 <- data.table(A= round(rnorm(10,10,10),2),
B= round(rnorm(10,10,10),2),
C= round(rnorm(10,10,10),2),
D= round(rnorm(10,10,10),2))
D <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
DT_3 <- DT
DT_4 <- DT_2
DT_5 <- DT_2
DT <- cbind(DT, D)
DT_3 <- cbind(DT_3, D)
DT <- rbind (DT, DT_2, DT_3, DT_4, DT_5)
names(DT) <- c("something","nothing", "anything", "number4")