Apologies in advance if this has already been asked/answered elsewhere, but I've looked at quite a number of threads and nothing's worked so far.
I'm working on combining a large number of data files into one database. Each data file has the same first four (sometimes five) column names which are words/characters, but after that the columns names are labelled with numbers, and are different (they'll be the same for a set of files, respectively, but different for a different set of files). So just to give an example, say file 1 has columns a, b, c, d, 1.1, 1.2, 2.3, file 2 has columns a, b, c, d, 1.3, 1.4, 2.1, and file 3 has columns a, b, c, e, 3.2, 5.1.
Each file then contains a varying number of observations (the first column is always the report date). Some observations are numerical and others are characters. I want to read all the files in at once and combine them into a data frame such that (1) the columns which are shared with other files are combined into one, (2) columns which are different are automatically added, and (3) for those observation where the values observed in the first four/five columns, ie report date and similar specifications, are all identical, observations are entered on the same row. For example, if file 1 and 2 are identical on columns a, b, c, and d, but file 1 has observations in columns 1.1, 1.2, 2.3, with the rest missing and file 2 has observations in columns a, b, c, d, 1.3, 1.4, 2.1 and no others, I want those observations to just be added on the same row. (So far the best I've been able to do is to have a separate row for each row from the original files, which leads to my end result consisting mostly of NAs/empty cells, and not being very compact or usable.)
I have a large number of files, each of which is a different length, and I want to read them in and merge them all at once using a loop. What I've managed so far is as follows:
# packages
library(data.table)
library(plyr)
library(reshape)
library(dplyr)
#make a list of files in a folder and label it "filenames"
filenames <- list.files("path", full.names = T)
#read each element in "filenames" into R and label the resulting list "csvs"
csvs <- lapply(filenames, read.csv)
#merge all elements of "csvs" into one data table
merged.sheet = Reduce(function(...) merge(..., all=T), csvs)
#export table as csv
write.csv(merged.sheet, "path")
The result has all the data I want and it has added each column only once, as I hoped (although the order of columns is odd and I don't know how to sort it the way I want, plus R has added an X to each of the column names for some reason). However, it is not compact at all, as it has just put one row below the other, even when those could have been combined because the identifying values (date, category, etc.) were the same and the observations in different columns.
I've played around with it quite a bit and googled extensively, but nothing has worked so far. For example, I've tried setkey before merging but was given an error because what I read in initially was a list, not a dataframe; I've tried various melt functions, but they all either returned an error (when I specified the ID variables, which R told me it couldn't find in the data even though I had explicitly copied them over) or didn't identify my numbered columns as IDs and omitted quite a bit of data (when I didn't specify the ID variables). I also tried passing arguments to the merge function, for example by="a", but that did not give me the result I wanted. I tried by="a", by.x="b", and by.y="c", which returned an error message (something about the length of the argument not being correct). Passing several arguments to by= also returned an error (because only one unique column name is allowed).
I'm new to R and can't think of anything else to try. Any help would be much appreciated!
EDIT: I've created some sample data that should illustrate what my dataset looks like. The sample data consists of 5 files.
EDIT2: here are the structures of the 5 sample files:
dput(File1)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"),
Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA,
34343L, NA, NA, 360000000L, 1000000000L, 13500000L), X2.2 = c(NA,
NA, NA, NA, NA, NA, 520000000L, 270000000L, 178L), X3.1 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(540000, 3.02e+08,
150, NA, NA, NA, 11111111, 2323232, 102)), .Names = c("ReportDate",
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA,
-9L))
> dput(File2)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"),
Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA,
76000L, NA, NA, 13000000L, 13000000L, 24000L), X2.2 = c(NA,
NA, NA, NA, NA, NA, 90909090L, 325500L, 198000L), X3.1 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(1.6e+10, 2434340000,
2.8e+10, NA, NA, NA, 500, 21000, 6.5e+10)), .Names = c("ReportDate",
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA,
-9L))
> dput(File3)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "01/12/2016", class = "factor"),
RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"),
RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e"), class = "factor"),
Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),
Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz"), class = "factor"),
X3.7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 7000000L, 650404040L), X3.8 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA,
NA, NA, NA, NA, NA, 123456, 1.7e+11, NA, NA, 50004444, 50004444,
1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA,
NA, 1.7e+10, 2.8005e+10, NA, NA, 3e+09, 3e+09, 4e+09, 4e+09,
3.5e+09, 3.5e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.3434e+10,
4.3434e+10, NA, NA, 3870015600, 3762897490, 54545454, 7006666,
9.3e+11, 7675030303)), .Names = c("ReportDate", "RL", "RLI",
"Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", "X3.11",
"X3.12"), class = "data.frame", row.names = c(NA, -16L))
> dput(File4)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"),
RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"),
RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L,
4L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e", "f"), class = "factor"),
Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),
Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz", "yx"
), class = "factor"), X3.7 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, 1900000L, 630404040L), X3.8 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), X3.9 = c(NA, NA, NA, NA, NA, NA, 503456, 1.27e+11, NA,
NA, 51004444, 51004444, 1200000, 1200000, NA, NA), X3.11 = c(NA,
NA, NA, NA, NA, NA, 1.6e+10, 1.3005e+10, NA, NA, 3e+09, 4.3e+09,
4e+09, 4e+09, 2.8e+09, 2.8e+09), X3.12 = c(NA, NA, NA, NA,
NA, NA, 4.4434e+10, 4.4434e+10, NA, NA, 4070015600, 3762897490,
54545454, 8006666, 9.3e+10, 7585030303)), .Names = c("ReportDate",
"RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9",
"X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L
))
> dput(File5)
structure(list(ReportDate = structure(c(1L, 1L, 1L), .Label = "30/10/2016", class = "factor"),
RL = structure(c(1L, 1L, 1L), .Label = "Service2", class = "factor"),
RLI = structure(c(1L, 1L, 1L), .Label = "cd", class = "factor"),
Identifier1 = structure(c(2L, 1L, 2L), .Label = c("h", "j"
), class = "factor"), Identifier2 = structure(c(1L, 2L, 2L
), .Label = c("xz", "yx"), class = "factor"), X5.1 = c(656565L,
2340808L, NA), X5.2 = c(104L, NA, NA), X5.4 = c(64343L, NA,
NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier1",
"Identifier2", "X5.1", "X5.2", "X5.4"), class = "data.frame", row.names = c(NA,
-3L))
The actual dataset looks more or less like this, except that the variables have different names and observations, there are hundreds of files, and there are more of the numbered variables (the Xs before the variables were added by R, they're not in the original csv files). (Also, here in the sample data I've only added numbers and empty cells, but some of the observations in the original dataset are in characters.)
I want to merge these files into one in a way that makes the result as compact as possible. For example, observation 1 in File 5 should be on the same row as observation 1 in File 4, because Date, RL, RLI, and Identifiers 1 and 2 are the same for both, and the observations are in different columns. But if the date or one of the other identifiers were different then they should be on separate rows.
My three main attempts so far were as follows:
# packages
library(data.table)
library(openxlsx)
library(plyr)
library(reshape)
library(dplyr)
library(tidyverse)
library(purrr)
##attempt 1
#make a list of files in a folder and label it "allFiles"
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
allFiles <- lapply(allFiles, read.csv)
#merge all elements of "allFiles" into one datatable
merged.sheet = Reduce(function(...) merge(..., all=T), allFiles)
This was the best method so far. The structure of merged.sheet is
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("30/10/2016",
"01/12/2016"), class = "factor"), RL = structure(c(1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Service 1",
"Service 2", "Service2"), class = "factor"), RLI = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L,
4L), .Label = c("ab", "cd", "f", "e"), class = "factor"), Identifier2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L), .Label = c("xy", "xz", "yx"), class = "factor"), Identifier1 = structure(c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L), .Label = c("h", "j"), class = "factor"), X3.7 = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 630404040L, NA, NA, 1900000L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
650404040L, NA, 7000000L, NA), X3.8 = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 503456, NA, NA,
1.27e+11, NA, NA, 51004444, NA, 1200000, 51004444, NA, 1200000,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 50004444, NA,
NA, 1.7e+11, 123456, NA, NA, NA, 50004444, NA, 1200000, NA, 1200000
), X3.11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1.6e+10, NA, NA, 1.3005e+10, NA, NA, 3e+09, 2.8e+09, 4e+09,
4.3e+09, 2.8e+09, 4e+09, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 3e+09, NA, NA, 2.8005e+10, 1.7e+10, NA, NA, NA, 3e+09,
3.5e+09, 4e+09, 3.5e+09, 4e+09), X3.12 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 4.4434e+10, NA, NA, 4.4434e+10,
NA, NA, 4070015600, 7585030303, 8006666, 3762897490, 9.3e+10,
54545454, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3870015600,
NA, NA, 4.3434e+10, 4.3434e+10, NA, NA, NA, 3762897490, 7675030303,
7006666, 9.3e+11, 54545454), X2.1 = c(34343L, NA, NA, NA, NA,
NA, 360000000L, 1000000000L, 13500000L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 76000L,
NA, 13000000L, 13000000L, 24000L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), X2.2 = c(NA, NA, NA, NA,
NA, NA, 520000000L, 270000000L, 178L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
325500L, 90909090L, 198000L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), X3.1 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(NA,
150, 540000, 3.02e+08, NA, NA, 11111111, 2323232, 102, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2434340000,
1.6e+10, 2.8e+10, NA, NA, NA, 21000, 500, 6.5e+10, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.1 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 656565L, 656565L,
656565L, 2340808L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), X5.2 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 104L, 104L, 104L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.4 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 64343L, 64343L,
64343L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier2",
"Identifier1", "X3.7", "X3.8", "X3.9", "X3.11", "X3.12", "X2.1",
"X2.2", "X3.1", "X3.5", "X5.1", "X5.2", "X5.4"), row.names = c(NA,
-50L), class = "data.frame")
The main issue is that the database is not very compact. Everything that's on a separate row in the original file is on a separate row in the merged sheet, even where all the identifiers are identical. This makes the dataset extremely large when it's done with the actual data, and most of the cells are NAs.
##attempt 2
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
allFiles <- lapply(allFiles, read.csv)
#df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2"))
df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI"))
This didn't work. I don't know what went wrong exactly, but most of the data is missing from df and I don't know what to make of the structure that resulted:
structure(list(ReportDate = c("30/10/2016", "30/10/2016", "30/10/2016",
"30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016",
"30/10/2016"), RL = c("Service 1", "Service 1", "Service 1",
"Service 1", "Service 1", "Service 2", "Service 2", "Service 2",
"Service 2"), RLI = c("ab", "ab", "ab", "ab", "ab", "ab", "cd",
"cd", "f"), Identifier2.x = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1.x = c(NA,
NA, NA, 34343L, NA, NA, 360000000L, 1000000000L, 13500000L),
X2.2.x = c(NA, NA, NA, NA, NA, NA, 520000000L, 270000000L,
178L), X3.1.x = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.x = c(540000,
3.02e+08, 150, NA, NA, NA, 11111111, 2323232, 102), Identifier2.y = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = "xy", class = "factor"),
X2.1.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X2.2.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X3.1.y = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.y = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Identifier1.x = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
), class = "factor"), Identifier2.x.x = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("xy",
"xz"), class = "factor"), X3.7.x = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X3.8.x = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA), X3.9.x = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), X3.11.x = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), X3.12.x = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Identifier1.y = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
), class = "factor"), Identifier2.y.y = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("xy",
"xz", "yx"), class = "factor"), X3.7.y = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X3.8.y = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA), X3.9.y = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), X3.11.y = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), X3.12.y = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Identifier1 = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
), class = "factor"), Identifier2 = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = c("xz",
"yx"), class = "factor"), X5.1 = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X5.2 = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), X5.4 = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), class = "data.frame", row.names = c(NA,
-9L), .Names = c("ReportDate", "RL", "RLI", "Identifier2.x",
"X2.1.x", "X2.2.x", "X3.1.x", "X3.5.x", "Identifier2.y", "X2.1.y",
"X2.2.y", "X3.1.y", "X3.5.y", "Identifier1.x", "Identifier2.x.x",
"X3.7.x", "X3.8.x", "X3.9.x", "X3.11.x", "X3.12.x", "Identifier1.y",
"Identifier2.y.y", "X3.7.y", "X3.8.y", "X3.9.y", "X3.11.y", "X3.12.y",
"Identifier1", "Identifier2", "X5.1", "X5.2", "X5.4"))
Here's my last attempt:
##attempt 3 (Table to be read in is an Excel spreadsheet with the variable names ReportDate and so on as the top row.)
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
for(i in 1:length(allFiles)) {
dt <- read.csv(allFiles[i])
dt1 <- as.data.table(dt)
#set keys
setkey(Table1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
setkey(dt1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
NewTable <- merge(Table1, dt1, all=TRUE)
return(NewTable)
rm(dt1)
}
Attempt 3 returned the following error message:
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
some columns are not in the data.table: Identifier1
This is a problem in and of itself. Some of the data files I am trying to merge contain Identifier1, some contain Identifier2, and some contain both. I need those to be used as keys so that data will only be merged onto one row if the same Identifier(s) is/are present in the files in question AND contain the same values for the observations which should go on the same row. However, my function only permits keys which are present in all the files to be merged, it would appear. However, for the moment I re-ran the code without these keys:
##attempt 4
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)
pathName <- "path"
allFiles <- list.files(pathName, full.names = T)
for(i in 1:length(allFiles)) {
dt <- read.csv(allFiles[i])
dt1 <- as.data.table(dt)
#set keys
setkey(Table1, "ReportDate", "RL", "RLI")
setkey(dt1, "ReportDate", "RL", "RLI")
NewTable <- merge(Table1, dt1, all=TRUE)
return(NewTable)
rm(dt1)
}
This returned another error:
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
x.'ReportDate' is a factor column being joined to i.'ReportDate' which is type 'logical'. Factor columns must join to factor or character columns.
EDIT3: the structure of Table1 from attempts 3 and 4 is as follows:
dput(Table1)
structure(list(ReportDate = logical(0), RL = logical(0), RLI = logical(0),
Identifier1 = logical(0), Identifier2 = logical(0), `2.1` = logical(0),
`2.2000000000000002` = logical(0), `3.1` = logical(0), `3.5` = logical(0),
`3.7` = logical(0), `3.8` = logical(0), `3.9` = logical(0),
`3.11` = logical(0), `3.12` = logical(0), `5.0999999999999996` = logical(0),
`5.2` = logical(0), `5.4` = logical(0)), .Names = c("ReportDate",
"RL", "RLI", "Identifier1", "Identifier2", "2.1", "2.2000000000000002",
"3.1", "3.5", "3.7", "3.8", "3.9", "3.11", "3.12", "5.0999999999999996",
"5.2", "5.4"), row.names = integer(0), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000000000b0788>, sorted = c("ReportDate",
"RL", "RLI"))