0

Apologies in advance if this has already been asked/answered elsewhere, but I've looked at quite a number of threads and nothing's worked so far.

I'm working on combining a large number of data files into one database. Each data file has the same first four (sometimes five) column names which are words/characters, but after that the columns names are labelled with numbers, and are different (they'll be the same for a set of files, respectively, but different for a different set of files). So just to give an example, say file 1 has columns a, b, c, d, 1.1, 1.2, 2.3, file 2 has columns a, b, c, d, 1.3, 1.4, 2.1, and file 3 has columns a, b, c, e, 3.2, 5.1.

Each file then contains a varying number of observations (the first column is always the report date). Some observations are numerical and others are characters. I want to read all the files in at once and combine them into a data frame such that (1) the columns which are shared with other files are combined into one, (2) columns which are different are automatically added, and (3) for those observation where the values observed in the first four/five columns, ie report date and similar specifications, are all identical, observations are entered on the same row. For example, if file 1 and 2 are identical on columns a, b, c, and d, but file 1 has observations in columns 1.1, 1.2, 2.3, with the rest missing and file 2 has observations in columns a, b, c, d, 1.3, 1.4, 2.1 and no others, I want those observations to just be added on the same row. (So far the best I've been able to do is to have a separate row for each row from the original files, which leads to my end result consisting mostly of NAs/empty cells, and not being very compact or usable.)

I have a large number of files, each of which is a different length, and I want to read them in and merge them all at once using a loop. What I've managed so far is as follows:

# packages
library(data.table)
library(plyr)
library(reshape)
library(dplyr)

#make a list of files in a folder and label it "filenames"
filenames <- list.files("path", full.names = T)

#read each element in "filenames" into R and label the resulting list "csvs"
csvs <- lapply(filenames, read.csv)

#merge all elements of "csvs" into one data table
merged.sheet = Reduce(function(...) merge(..., all=T), csvs)

#export table as csv
write.csv(merged.sheet, "path")

The result has all the data I want and it has added each column only once, as I hoped (although the order of columns is odd and I don't know how to sort it the way I want, plus R has added an X to each of the column names for some reason). However, it is not compact at all, as it has just put one row below the other, even when those could have been combined because the identifying values (date, category, etc.) were the same and the observations in different columns.

I've played around with it quite a bit and googled extensively, but nothing has worked so far. For example, I've tried setkey before merging but was given an error because what I read in initially was a list, not a dataframe; I've tried various melt functions, but they all either returned an error (when I specified the ID variables, which R told me it couldn't find in the data even though I had explicitly copied them over) or didn't identify my numbered columns as IDs and omitted quite a bit of data (when I didn't specify the ID variables). I also tried passing arguments to the merge function, for example by="a", but that did not give me the result I wanted. I tried by="a", by.x="b", and by.y="c", which returned an error message (something about the length of the argument not being correct). Passing several arguments to by= also returned an error (because only one unique column name is allowed).

I'm new to R and can't think of anything else to try. Any help would be much appreciated!

EDIT: I've created some sample data that should illustrate what my dataset looks like. The sample data consists of 5 files.

EDIT2: here are the structures of the 5 sample files:

dput(File1)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), 
    Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 
    34343L, NA, NA, 360000000L, 1000000000L, 13500000L), X2.2 = c(NA, 
    NA, NA, NA, NA, NA, 520000000L, 270000000L, 178L), X3.1 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(540000, 3.02e+08, 
    150, NA, NA, NA, 11111111, 2323232, 102)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, 
-9L))
> dput(File2)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"
), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), 
    Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 
    76000L, NA, NA, 13000000L, 13000000L, 24000L), X2.2 = c(NA, 
    NA, NA, NA, NA, NA, 90909090L, 325500L, 198000L), X3.1 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(1.6e+10, 2434340000, 
    2.8e+10, NA, NA, NA, 500, 21000, 6.5e+10)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, 
-9L))
> dput(File3)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), 
    RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), 
    RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e"), class = "factor"), 
    Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"), 
    Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz"), class = "factor"), 
    X3.7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, 7000000L, 650404040L), X3.8 = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, 
    NA, NA, NA, NA, NA, 123456, 1.7e+11, NA, NA, 50004444, 50004444, 
    1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA, 
    NA, 1.7e+10, 2.8005e+10, NA, NA, 3e+09, 3e+09, 4e+09, 4e+09, 
    3.5e+09, 3.5e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.3434e+10, 
    4.3434e+10, NA, NA, 3870015600, 3762897490, 54545454, 7006666, 
    9.3e+11, 7675030303)), .Names = c("ReportDate", "RL", "RLI", 
"Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", "X3.11", 
"X3.12"), class = "data.frame", row.names = c(NA, -16L))
> dput(File4)
structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), 
    RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), 
    RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 
    4L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e", "f"), class = "factor"), 
    Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"), 
    Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
    3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz", "yx"
    ), class = "factor"), X3.7 = c(NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, 1900000L, 630404040L), X3.8 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
    ), X3.9 = c(NA, NA, NA, NA, NA, NA, 503456, 1.27e+11, NA, 
    NA, 51004444, 51004444, 1200000, 1200000, NA, NA), X3.11 = c(NA, 
    NA, NA, NA, NA, NA, 1.6e+10, 1.3005e+10, NA, NA, 3e+09, 4.3e+09, 
    4e+09, 4e+09, 2.8e+09, 2.8e+09), X3.12 = c(NA, NA, NA, NA, 
    NA, NA, 4.4434e+10, 4.4434e+10, NA, NA, 4070015600, 3762897490, 
    54545454, 8006666, 9.3e+10, 7585030303)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", 
"X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L
))
> dput(File5)
structure(list(ReportDate = structure(c(1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), 
    RL = structure(c(1L, 1L, 1L), .Label = "Service2", class = "factor"), 
    RLI = structure(c(1L, 1L, 1L), .Label = "cd", class = "factor"), 
    Identifier1 = structure(c(2L, 1L, 2L), .Label = c("h", "j"
    ), class = "factor"), Identifier2 = structure(c(1L, 2L, 2L
    ), .Label = c("xz", "yx"), class = "factor"), X5.1 = c(656565L, 
    2340808L, NA), X5.2 = c(104L, NA, NA), X5.4 = c(64343L, NA, 
    NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", 
"Identifier2", "X5.1", "X5.2", "X5.4"), class = "data.frame", row.names = c(NA, 
-3L))

The actual dataset looks more or less like this, except that the variables have different names and observations, there are hundreds of files, and there are more of the numbered variables (the Xs before the variables were added by R, they're not in the original csv files). (Also, here in the sample data I've only added numbers and empty cells, but some of the observations in the original dataset are in characters.)

I want to merge these files into one in a way that makes the result as compact as possible. For example, observation 1 in File 5 should be on the same row as observation 1 in File 4, because Date, RL, RLI, and Identifiers 1 and 2 are the same for both, and the observations are in different columns. But if the date or one of the other identifiers were different then they should be on separate rows.

My three main attempts so far were as follows:

# packages
library(data.table)
library(openxlsx)
library(plyr)
library(reshape)
library(dplyr)
library(tidyverse)
library(purrr)

##attempt 1

#make a list of files in a folder and label it "allFiles"

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 
allFiles <- lapply(allFiles, read.csv)

#merge all elements of "allFiles" into one datatable
merged.sheet = Reduce(function(...) merge(..., all=T), allFiles)

This was the best method so far. The structure of merged.sheet is

structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("30/10/2016", 
"01/12/2016"), class = "factor"), RL = structure(c(1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Service 1", 
"Service 2", "Service2"), class = "factor"), RLI = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
3L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 
4L), .Label = c("ab", "cd", "f", "e"), class = "factor"), Identifier2 = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 
1L), .Label = c("xy", "xz", "yx"), class = "factor"), Identifier1 = structure(c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 
2L), .Label = c("h", "j"), class = "factor"), X3.7 = c(NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 630404040L, NA, NA, 1900000L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
650404040L, NA, 7000000L, NA), X3.8 = c(NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 503456, NA, NA, 
1.27e+11, NA, NA, 51004444, NA, 1200000, 51004444, NA, 1200000, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 50004444, NA, 
NA, 1.7e+11, 123456, NA, NA, NA, 50004444, NA, 1200000, NA, 1200000
), X3.11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 1.6e+10, NA, NA, 1.3005e+10, NA, NA, 3e+09, 2.8e+09, 4e+09, 
4.3e+09, 2.8e+09, 4e+09, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 3e+09, NA, NA, 2.8005e+10, 1.7e+10, NA, NA, NA, 3e+09, 
3.5e+09, 4e+09, 3.5e+09, 4e+09), X3.12 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 4.4434e+10, NA, NA, 4.4434e+10, 
NA, NA, 4070015600, 7585030303, 8006666, 3762897490, 9.3e+10, 
54545454, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3870015600, 
NA, NA, 4.3434e+10, 4.3434e+10, NA, NA, NA, 3762897490, 7675030303, 
7006666, 9.3e+11, 54545454), X2.1 = c(34343L, NA, NA, NA, NA, 
NA, 360000000L, 1000000000L, 13500000L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 76000L, 
NA, 13000000L, 13000000L, 24000L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), X2.2 = c(NA, NA, NA, NA, 
NA, NA, 520000000L, 270000000L, 178L, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
325500L, 90909090L, 198000L, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), X3.1 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(NA, 
150, 540000, 3.02e+08, NA, NA, 11111111, 2323232, 102, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2434340000, 
1.6e+10, 2.8e+10, NA, NA, NA, 21000, 500, 6.5e+10, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.1 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 656565L, 656565L, 
656565L, 2340808L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), X5.2 = c(NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 104L, 104L, 104L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X5.4 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 64343L, 64343L, 
64343L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier2", 
"Identifier1", "X3.7", "X3.8", "X3.9", "X3.11", "X3.12", "X2.1", 
"X2.2", "X3.1", "X3.5", "X5.1", "X5.2", "X5.4"), row.names = c(NA, 
-50L), class = "data.frame")

The main issue is that the database is not very compact. Everything that's on a separate row in the original file is on a separate row in the merged sheet, even where all the identifiers are identical. This makes the dataset extremely large when it's done with the actual data, and most of the cells are NAs.

##attempt 2

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 
allFiles <- lapply(allFiles, read.csv)

#df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2"))
df <- allFiles %>% purrr::reduce(dplyr::left_join, by = c("ReportDate", "RL", "RLI"))

This didn't work. I don't know what went wrong exactly, but most of the data is missing from df and I don't know what to make of the structure that resulted:

structure(list(ReportDate = c("30/10/2016", "30/10/2016", "30/10/2016", 
"30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016", "30/10/2016", 
"30/10/2016"), RL = c("Service 1", "Service 1", "Service 1", 
"Service 1", "Service 1", "Service 2", "Service 2", "Service 2", 
"Service 2"), RLI = c("ab", "ab", "ab", "ab", "ab", "ab", "cd", 
"cd", "f"), Identifier2.x = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1.x = c(NA, 
NA, NA, 34343L, NA, NA, 360000000L, 1000000000L, 13500000L), 
    X2.2.x = c(NA, NA, NA, NA, NA, NA, 520000000L, 270000000L, 
    178L), X3.1.x = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.x = c(540000, 
    3.02e+08, 150, NA, NA, NA, 11111111, 2323232, 102), Identifier2.y = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = "xy", class = "factor"), 
    X2.1.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), X2.2.y = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), X3.1.y = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.5.y = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Identifier1.x = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
    ), class = "factor"), Identifier2.x.x = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("xy", 
    "xz"), class = "factor"), X3.7.x = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X3.8.x = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA), X3.9.x = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
    ), X3.11.x = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), X3.12.x = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Identifier1.y = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
    ), class = "factor"), Identifier2.y.y = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("xy", 
    "xz", "yx"), class = "factor"), X3.7.y = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X3.8.y = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA), X3.9.y = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
    ), X3.11.y = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), X3.12.y = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Identifier1 = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("h", "j"
    ), class = "factor"), Identifier2 = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("xz", 
    "yx"), class = "factor"), X5.1 = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X5.2 = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_), X5.4 = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_)), class = "data.frame", row.names = c(NA, 
-9L), .Names = c("ReportDate", "RL", "RLI", "Identifier2.x", 
"X2.1.x", "X2.2.x", "X3.1.x", "X3.5.x", "Identifier2.y", "X2.1.y", 
"X2.2.y", "X3.1.y", "X3.5.y", "Identifier1.x", "Identifier2.x.x", 
"X3.7.x", "X3.8.x", "X3.9.x", "X3.11.x", "X3.12.x", "Identifier1.y", 
"Identifier2.y.y", "X3.7.y", "X3.8.y", "X3.9.y", "X3.11.y", "X3.12.y", 
"Identifier1", "Identifier2", "X5.1", "X5.2", "X5.4"))

Here's my last attempt:

##attempt 3 (Table to be read in is an Excel spreadsheet with the variable names ReportDate and so on as the top row.)
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 

for(i in 1:length(allFiles)) {

  dt <- read.csv(allFiles[i])
  dt1 <- as.data.table(dt)

  #set keys
  setkey(Table1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
  setkey(dt1, "ReportDate", "RL", "RLI", "Identifier1", "Identifier2")
  NewTable <- merge(Table1, dt1, all=TRUE)
  return(NewTable)
  rm(dt1)
}

Attempt 3 returned the following error message:

Error in setkeyv(x, cols, verbose = verbose, physical = physical) : 
  some columns are not in the data.table: Identifier1 

This is a problem in and of itself. Some of the data files I am trying to merge contain Identifier1, some contain Identifier2, and some contain both. I need those to be used as keys so that data will only be merged onto one row if the same Identifier(s) is/are present in the files in question AND contain the same values for the observations which should go on the same row. However, my function only permits keys which are present in all the files to be merged, it would appear. However, for the moment I re-ran the code without these keys:

##attempt 4
Table <- read.xlsx("path")
Table1 <- as.data.table(Table)

pathName <- "path"
allFiles <- list.files(pathName, full.names = T) 

for(i in 1:length(allFiles)) {

  dt <- read.csv(allFiles[i])
  dt1 <- as.data.table(dt)

  #set keys
  setkey(Table1, "ReportDate", "RL", "RLI")
  setkey(dt1, "ReportDate", "RL", "RLI")
  NewTable <- merge(Table1, dt1, all=TRUE)
  return(NewTable)
  rm(dt1)
}

This returned another error:

 Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'ReportDate' is a factor column being joined to i.'ReportDate' which is type 'logical'. Factor columns must join to factor or character columns. 

EDIT3: the structure of Table1 from attempts 3 and 4 is as follows:

dput(Table1)
structure(list(ReportDate = logical(0), RL = logical(0), RLI = logical(0), 
    Identifier1 = logical(0), Identifier2 = logical(0), `2.1` = logical(0), 
    `2.2000000000000002` = logical(0), `3.1` = logical(0), `3.5` = logical(0), 
    `3.7` = logical(0), `3.8` = logical(0), `3.9` = logical(0), 
    `3.11` = logical(0), `3.12` = logical(0), `5.0999999999999996` = logical(0), 
    `5.2` = logical(0), `5.4` = logical(0)), .Names = c("ReportDate", 
"RL", "RLI", "Identifier1", "Identifier2", "2.1", "2.2000000000000002", 
"3.1", "3.5", "3.7", "3.8", "3.9", "3.11", "3.12", "5.0999999999999996", 
"5.2", "5.4"), row.names = integer(0), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000000000b0788>, sorted = c("ReportDate", 
"RL", "RLI"))
Pia
  • 77
  • 7
  • 2
    We really badly need an example of input and output :) – moodymudskipper Jul 18 '17 at 14:31
  • Also explain what you mean by "combine" or "entered on the same row", you want to concatenate strings ? – moodymudskipper Jul 18 '17 at 14:34
  • Agreed, can you `dput` your `csvs` object so we can see how your data looks? See the first answer here on how to share a minimal reproducible dataset in a question: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – meenaparam Jul 18 '17 at 14:36
  • If you can't share your data, then assuming `csvs` is a list of dataframes, you could use the `reduce` function from the `purrr` package, alongside `left_join` from `dplyr`, with the identical columns in your different dataframes as the joining variables see e.g. here https://daranzolin.github.io/articles/2016-12/join-list-dataframes But if you share your data, we can help you more and show you how to do this. – meenaparam Jul 18 '17 at 14:39
  • Hi there, thank you very much for your comments and help! Unfortunately I can't share my data for confidentiality reasons. But csvs is essentially just what I named the list of files I read into R. I listed all the files in a specific folder, and then read them all in using lapply and read.csv. The result was a list of dataframes which I named csvs. I will try the method you linked, meenaparam, and let you know if it worked! – Pia Jul 18 '17 at 15:03
  • Hi @meenaparam, I tried running the code from the link you shared, but somehow R returned an error message - I don't know which part I got wrong. So based on the instructions, I ran the following code: `pathName <- "path" allFiles <- list.files(pathName, full.names = T) allFiles <- lapply(allFiles, read.csv) create_df <- function(x) { df <- data.frame(allFiles) return(df) } allFiles %>% map(create_df) %>% reduce(left_join, by = "ReportDate")` – Pia Jul 18 '17 at 16:06
  • The error message I got was `Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 6, 10, 4, 7, 24, 36, 8, 14, 2, 1, 16, 18, 70, 53, 44, 38, 45, 28, 64, 76, 5, 19, 3` (Also, I’d like to pass more column names to “by”, but I don’t know if or how that’ll work even if I hadn’t already encountered an issue.) – Pia Jul 18 '17 at 16:07
  • @P-K Ah, ok, if you can't share your data, can you make up some similar data? It looks like your error is actually coming from your `create_df` function. Is your `df` object being created? You don't need to have that `create_df` function. You should just have all your dataframes in a list and then use `map` and `reduce`. It will be something like `my_final_df <- my_list_of_dfs %>% purrr::reduce(dplyr::left_join, by = c("joining_var_1", "joining_var_2"))`. – meenaparam Jul 18 '17 at 16:14
  • @meenaparam, I tried re-running the code without the create_df function, but the result was missing most of my data and looked really strange. I'm not sure how to use map. I've made up some sample data and edited my original post, and also added the three methods I've tried with results and/or error messages returned. – Pia Jul 19 '17 at 09:43
  • @P-K, Thanks for all the extra info, it's great to see what you have tried and what hasn't worked so far. Ok, I will try and `read.table` your five made up files. For future info, if you `dput` your data structures, it makes it much easier for people to help, as then they can read your data in immediately. – meenaparam Jul 19 '17 at 09:49
  • @P-K Actually, can you `dput` the structure after? `read.table` is throwing up lots of errors about columns and column names, probably because of the copying and pasting from your SO post to my R console. Once we have the data structures, we can try and solve your problem. – meenaparam Jul 19 '17 at 09:57
  • `> dput(Table1) structure(list(ReportDate = logical(0), RL = logical(0), RLI = logical(0), Identifier1 = logical(0), Identifier2 = logical(0), `2.1` = logical(0), `2.2000000000000002` = logical(0), `3.1` = logical(0), `3.5` = logical(0), `3.7` = logical(0), `3.8` = logical(0), `3.9` = logical(0), `3.11` = logical(0), `3.12` = logical(0), `5.0999999999999996` = logical(0),` to be continued – Pia Jul 19 '17 at 10:15
  • ` `5.2` = logical(0), `5.4` = logical(0)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2", "2.1", "2.2000000000000002", "3.1", "3.5", "3.7", "3.8", "3.9", "3.11", "3.12", "5.0999999999999996", "5.2", "5.4"), row.names = integer(0), class = c("data.table", "data.frame"), .internal.selfref = , sorted = c("ReportDate", "RL", "RLI"))` – Pia Jul 19 '17 at 10:16
  • @meenaparam, I've added the structure of Table1 from attempts 3 and 4 in two comments above (it didn't fit in the original post and was also too long for one comment). – Pia Jul 19 '17 at 10:17
  • @P-K, Ok thanks. If you can, just delete the five sample files you have pasted into your question and replace that space with the five dputs. This will allow other people to help you directly too, rather than having to look through the comments. – meenaparam Jul 19 '17 at 10:24

1 Answers1

0

Ok I think this is what you want. This solution uses dplyr and purrr.

First load in the sample data.

df1 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L,2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"), Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA, 34343L, NA, NA, 360000000L, 1000000000L, 13500000L), X2.2 = c(NA, NA, NA, NA, NA, NA, 520000000L, 270000000L, 178L), X3.1 = c(NA,  NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(540000, 3.02e+08, 150, NA, NA, NA, 11111111, 2323232, 102)), .Names = c("ReportDate", "RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, 
-9L))    
df2 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L,1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Service 1", "Service 2"), class = "factor"), RLI = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L), .Label = c("ab", "cd", "f"), class = "factor"),  Identifier2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,  1L), .Label = "xy", class = "factor"), X2.1 = c(NA, NA, NA,  76000L, NA, NA, 13000000L, 13000000L, 24000L), X2.2 = c(NA,  NA, NA, NA, NA, NA, 90909090L, 325500L, 198000L), X3.1 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA), X3.5 = c(1.6e+10, 2434340000,2.8e+10, NA, NA, NA, 500, 21000, 6.5e+10)), .Names = c("ReportDate","RL", "RLI", "Identifier2", "X2.1", "X2.2", "X3.1", "X3.5"), class = "data.frame", row.names = c(NA, -9L))
df3 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "01/12/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e"), class = "factor"), Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz"), class = "factor"),X3.7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7000000L, 650404040L), X3.8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, NA, NA, NA, NA, NA, 123456, 1.7e+11, NA, NA, 50004444, 50004444,1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA,NA, 1.7e+10, 2.8005e+10, NA, NA, 3e+09, 3e+09, 4e+09, 4e+09, 3.5e+09, 3.5e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.3434e+10, 4.3434e+10, NA, NA, 3870015600, 3762897490, 54545454, 7006666,9.3e+11, 7675030303)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9", "X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L))    
df4 <- structure(list(ReportDate = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 3L), .Label = c("ab", "cd", "e", "f"), class = "factor"), Identifier1 = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("h", "j"), class = "factor"),Identifier2 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("xy", "xz", "yx" ), class = "factor"), X3.7 = c(NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, NA, NA, NA, NA, 1900000L, 630404040L), X3.8 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), X3.9 = c(NA, NA, NA, NA, NA, NA, 503456, 1.27e+11, NA, NA, 51004444, 51004444, 1200000, 1200000, NA, NA), X3.11 = c(NA, NA, NA, NA, NA, NA, 1.6e+10, 1.3005e+10, NA, NA, 3e+09, 4.3e+09, 4e+09, 4e+09, 2.8e+09, 2.8e+09), X3.12 = c(NA, NA, NA, NA, NA, NA, 4.4434e+10, 4.4434e+10, NA, NA, 4070015600, 3762897490, 54545454, 8006666, 9.3e+10, 7585030303)), .Names = c("ReportDate", "RL", "RLI", "Identifier1", "Identifier2", "X3.7", "X3.8", "X3.9","X3.11", "X3.12"), class = "data.frame", row.names = c(NA, -16L))    
df5 <- structure(list(ReportDate = structure(c(1L, 1L, 1L), .Label = "30/10/2016", class = "factor"), RL = structure(c(1L, 1L, 1L), .Label = "Service2", class = "factor"), RLI = structure(c(1L, 1L, 1L), .Label = "cd", class = "factor"), Identifier1 = structure(c(2L, 1L, 2L), .Label = c("h", "j"), class = "factor"), Identifier2 = structure(c(1L, 2L, 2L), .Label = c("xz", "yx"), class = "factor"), X5.1 = c(656565L, 2340808L, NA), X5.2 = c(104L, NA, NA), X5.4 = c(64343L, NA, NA)), .Names = c("ReportDate", "RL", "RLI", "Identifier1","Identifier2", "X5.1", "X5.2", "X5.4"), class = "data.frame", row.names = c(NA, -3L))

Then load the libraries and put all the five dataframes into a list.

library(dplyr)
library(purrr)

dfs <- list("file1" = df1, "file2" = df2, "file3" = df3, "file4" = df4, "file5" = df5)

Now make a vector of the variable names which you ultimately want to join on.

shared_vars <- names(dfs$file5[1:5])

Because the five dataframes do not all have the same columns, and some missing columns e.g. Identifier1 are needed for joining, write a function that creates these missing columns and fills them with NAs where they don't already exist (fill missing columns adapted from here, with help on column type conversion here).

# function to create missing columns of joining variables where they don't already exist in a dataframe
make_missing_cols <- function(varnames, df) {
            if (sum(!varnames %in% names(df)) != 0) {
                new_df <- data.frame(df, setNames(as.list(rep(NA, sum(!varnames %in% names(df)))), setdiff(varnames, names(df))))
                # convert any new columns to factor (this will also change other logical columns to factors)
                new_df[sapply(new_df, is.logical)] <- lapply(new_df[sapply(new_df, is.logical)], as.factor)
                new_df[ ,order(colnames(new_df))] 
            } else {            
                new_df <- df[ , order(colnames(df))]
            }
}

Now apply the make_missing_cols function to each of the five dfs in the list to make a new list of five dfs, each now with all the same columns.

dfs_allcols <- 
    dfs %>% 
    map(~ make_missing_cols(varnames = shared_vars, df = .))

Finally, join the five dfs into a single df. Not specifying any by argument to full_join makes dplyr do the join on all variables with common names across the five dataframes. arrange just sorts outdf on the specified columns. distinct keeps the unique rows only.

outdf <- 
    dfs_allcols %>% 
    reduce(full_join) %>% 
    arrange(ReportDate, RL, RLI, Identifier1, Identifier2) %>%
    distinct

A snapshot of outdf:

# A tibble: 43 x 17
   Identifier1 Identifier2 ReportDate        RL   RLI     X2.1     X2.2   X3.1        X3.5 X3.11 X3.12
         <chr>       <chr>      <chr>     <chr> <chr>    <int>    <int> <fctr>       <dbl> <dbl> <dbl>
 1        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA 16000000000    NA    NA
 2        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA  2434340000    NA    NA
 3        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA 28000000000    NA    NA
 4        <NA>          xy 01/12/2016 Service 1    ab    76000       NA     NA          NA    NA    NA
 5        <NA>          xy 01/12/2016 Service 1    ab       NA       NA     NA          NA    NA    NA
 6        <NA>          xy 01/12/2016 Service 2    ab       NA       NA     NA          NA    NA    NA
 7        <NA>          xy 01/12/2016 Service 2    cd 13000000 90909090     NA         500    NA    NA
 8        <NA>          xy 01/12/2016 Service 2    cd 13000000   325500     NA       21000    NA    NA
 9        <NA>          xy 01/12/2016 Service 2     f    24000   198000     NA 65000000000    NA    NA
10           h          xz 01/12/2016  Service2    ab       NA       NA     NA          NA    NA    NA
# ... with 33 more rows, and 6 more variables: X3.7 <int>, X3.8 <lgl>, X3.9 <dbl>, X5.1 <int>, X5.2 <int>,
#   X5.4 <int>

Note that you may need to do some tinkering on outdf after this step to get variables to the correct column types, especially since the make_missing_cols function converts any logical columns to factor class (for joining purposes).

meenaparam
  • 1,949
  • 2
  • 17
  • 29
  • Thank you very much, I'll try that out just now! However, is it possible for rows to be merged into one row when the values contained in the first five rows are identical? For instance, in the snapshot, rows 3-5 have NA for Identifier1, xy for Identifier2, the same report date, Service 1 for RL, and ab for RLI. Row 4 has NAs where row 5 contains observations, and vice versa. So those should ideally be on the same row. – Pia Jul 20 '17 at 08:05
  • Yep, that is easy to do with the `outdf` object once it is created. Just run `dplyr::distinct` on the dataframe to return only unique rows. https://rdrr.io/cran/dplyr/man/distinct.html – meenaparam Jul 20 '17 at 11:16
  • I've added in a line to my answer to select those unique rows only. – meenaparam Jul 20 '17 at 11:22