3

I have an Excel file with a sheet for each week in my data set. Each sheet has the same number of rows, and each row is identical across the sheets (with the exception of the time period… sheet 1 represents week 1, sheet 2 week 2, etc.). I'm trying to import all the Excel sheets as one data frame in R.

For example, my data is essentially structured like this (with several more columns and sheets):

Week 1 sheet
ID    Gender    DOB    Absences    Lates    Absences_excused
1     M         1997   5           14       5
2     F         1998   4           3        2

Week 2 sheet
ID    Gender    DOB    Absences    Lates    Absences_excused
1     M         1997   2           10       3
2     F         1998   8           2        9

I'm trying to build a script that will take x numbers of sheets and combine them into one data frame like this:

Combined (ideal)
ID    Gender    DOB    Absences.1    Lates.1    Absences.2    Lates.2
1     M         1997   5             14         2             10
2     F         1998   4             3          8             2

I'm using gdata to import the Excel files.

I've tried creating a loop (normally bad for R, I know...) that will go through all the sheets in the Excel file and add each to a list as a data frame:

library(gdata)

number_sheets <- 3
all.sheets <- vector(mode="list", length=number_sheets)

for (i in 1:number_sheets) {
  all.sheets[[i]] <- read.xls("/path/to/file.xlsx", sheet=i)
}

This gives me a nice list, all.sheets, that I can access, but I'm unsure about the best way to create a new data frame from specific columns in the list of data frames.

I've tried the code below, which creates a brand new data frame by looping through the list of data frames. On the first data frame, it saves the columns that are consistent in all the sheets, and then adds the week-specific columns.

Cleaned <- data.frame()
number_sheets <- 3

for (i in 1:number_sheets) {
  if (i == 1) {
    Cleaned <- all.sheets[[i]][,c("ID", "Gender", "DOB")]
  }
  Cleaned$Absences.i <- all.sheets[[i]][,c("Absences")]  # wrong... obviously doesn't work... but essentially what I want
  # Other week-specific columns go here... somehow...
}

This code doesn't work though, since Cleaned$Absences.i is obviously not how you create dynamic columns in a data frame.

What's the best way to combine a set of data frames and create new columns for each of the variables I'm trying to track?

Extra hurdle: I'm also trying to combine two columns, "Absences" and "Absences_excused" into a single "Absences" column in the final data frame, so I'm trying to make my solution let me perform transformations to the new columns, like so (again, this isn't right):

Cleaned$Absences.i <- all.sheets[[i]][,c("Absences")] + all.sheets[[i]][,c("Absences_excused")]  
Andrew
  • 36,541
  • 13
  • 67
  • 93
  • I think you should clarify whether we can expect all sheets to have the same values and order for names == ID, Gender, DOB. If not, then `merge` is clearly the function you want, and if so, then `cbind` will suffice. – IRTFM Mar 04 '12 at 02:21
  • All the sheets have the same values and order for ID, Gender, DOB, and the other constant columns. They don't have the same values for Absences, Absences_excused, etc. – Andrew Mar 04 '12 at 02:23
  • I don't see any column for "excused". – IRTFM Mar 04 '12 at 02:35
  • Oh, yeah, I forgot to mention that the sample sheets above are just snippets of the full data sets. There are a dozen other columns for types of absences, lates, etc., that I'm going to aggregate when combining the sheets into one data frame – Andrew Mar 04 '12 at 02:48
  • 1
    IMHO, the flat data structure you are asking for is not a great design choice, especially if you plan to make changes (re: your *Extra hurdle* section). Instead, you should consider a database format: the result of concatenating your data structures vertically, with an added column for the week number. It will be much easier to work with and you may easily pivot it when you are done: in R using the "reshape" package for example, or in Excel using the pivot feature. – flodel Mar 04 '12 at 03:01
  • @flodel I also considered doing this—and would almost prefer to do this—but I couldn't figure out the best way to add a new week number column to the imported data on the fly. I'm creating an R script that will automatically mangle the data that a client exports from their attendance database, so I can't edit the Excel file before doing anything with it in R… It would be trivial to go and do this all in Excel by hand, but sadly I can't. – Andrew Mar 04 '12 at 03:15
  • 1
    @Andrew, just do `all.sheets[[i]]$week <- i` inside your for loop. Then you can concatenate everything with `do.call(rbind, all.sheets)`. – flodel Mar 04 '12 at 03:24
  • Whoa. I like the `do.call(…)` function. I was just trying to do that with a loop. "No loops in R" should be my mantra :) – Andrew Mar 04 '12 at 03:37

2 Answers2

7

The merge strategy is:

> Week_1_sheet <- read.table(text="ID    Gender    DOB    Absences    Lates
+ 1     M         1997   5           14
+ 2     F         1998   4           3", header=TRUE)
> Week_2_sheet <- read.table(text="ID    Gender    DOB    Absences    Lates
+ 1     M         1997   2           10
+ 2     F         1998   8           2", header=TRUE)
> merge(Week_1_sheet, Week_2_sheet, 1:3)
  ID Gender  DOB Absences.x Lates.x Absences.y Lates.y
1  1      M 1997          5      14          2      10
2  2      F 1998          4       3          8       2

You can rename the columns with names(sheet) <- sub("x", 1, sheet), and again for y -> 2. I think the cbind strategy is OK but merge is probably better to learn.

@TylerRinker raises the question about acceptable arguments to the 'by' parameter. The relevant sentece in the help page is: "Columns can be specified by name, number or by a logical vector: the name "row.names" or the number 0 specifies the row names."

IRTFM
  • 258,963
  • 21
  • 364
  • 487
4

@ DWin I think the poster's problem is a little more complex than the example leads us to believe. I think the poster wants a multi merge as indicated by "week 1, sheet 2 week 2, etc.". My approach is a bit different. The extra hurdle can be taken care of before the merge using lapply with transform. Here's my solution for the merge using 3 data frames instead of 2.

#First read in three data frames
Week_1_sheet <- read.table(text="ID Gender  DOB Absences Unexcused_Absences Lates
1  1      M 1997        5                  1    14
2  2      F 1998        4                  2     3", header=TRUE)

Week_2_sheet <- read.table(text="ID Gender  DOB Absences Unexcused_Absences Lates
1  1      M 1997        2                  1    10
2  2      F 1998        8                  2     2
3  3      M 1998        8                  2     2", header=TRUE)

Week_3_sheet <- read.table(text="ID Gender  DOB Absences Unexcused_Absences Lates
1  1      M 1997        2                  1    10
2  2      F 1998        8                  2     2", header=TRUE)

#Put them into a list structure
WEEKlist <- list(Week_1_sheet , Week_2_sheet , Week_3_sheet)

#Transform to add the absences and unexcused absences and drop unexcused
lapply(seq_along(WEEKlist), function(x) {
    WEEKlist[[x]] <<- transform(WEEKlist[[x]], Absences=sum(Absences,
        Unexcused_Absences))[, -5]
    }
)

#Rename each data frame in the list with `<<-` that acts on environments
lapply(seq_along(WEEKlist), function(x) {
    y <- names(WEEKlist[[x]])
    names(WEEKlist[[x]]) <<- c(y[1:3], paste(y[4:length(y)], ".", x, sep=""))
    }
)

#loop through and merge by the common columns
DF <- WEEKlist[[1]][, 1:3]
for (.df in WEEKlist) { 
     DF <-merge(DF, .df, by=c('ID', 'Gender', 'DOB'), all=TRUE, suffixes=c("", ""))
}

DF

A 2nd approach (after renaming the data frame columns) is to use Reduce: Taken from (LINK)

merge.all <- function(frames, by) {
    return (Reduce(function(x, y) {merge(x, y, by = by, all = TRUE)}, frames))
}

merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB'))

I'm not sure which one is faster though.

EDIT: On a windows 7 machine running 1000 iterations the Reduce was faster:

    test replications elapsed relative user.self sys.self
1   LOOP         1000   10.12  1.62701      7.89        0
2 REDUCE         1000    6.22  1.00000      5.34        0
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519