1

I have to read multiple xlsx file with random names into single dataframe. Structure of each file is same. I have to import specific columns only.

I tried this:

dat <- read.xlsx("FILE.xlsx", sheetIndex=1, 
                  sheetName=NULL, startRow=5, 
                  endRow=NULL, as.data.frame=TRUE, 
                  header=TRUE)

But this is for only one file at a time and I couldn't specify my particular columns. I even tried :

site=list.files(pattern='[.]xls')

but after that loop isn't working. How to do it? Thanks in advance.

Jaap
  • 81,064
  • 34
  • 182
  • 193
user2504063
  • 23
  • 1
  • 2
  • 6

3 Answers3

2

I would read each sheet to a list:

Get file names:

f = list.files("./")

Read files:

dat = lapply(f, function(i){
    x = read.xlsx(i, sheetIndex=1, sheetName=NULL, startRow=5,
        endRow=NULL, as.data.frame=TRUE, header=T)
    # Get the columns you want, e.g. 1, 3, 5
    x = x[, c(1, 3, 5)]
    # You may want to add a column to say which file they're from
    x$file = i
    # Return your data
    x
})

You can then access the items in your list with:

dat[[1]]

Or do the same task to them with:

lapply(dat, colmeans)

Turn them into a data frame (where your file column now becomes useful):

dat = do.call("rbind.data.frame", dat)
MikeRSpencer
  • 1,276
  • 10
  • 24
  • Hey your code did work for me but it worked only for very small size files say 50KB. But I have files which are around 1MB size. What should I do then? – user2504063 Sep 04 '15 at 05:14
  • Try a different excel reading package, this one may be your best bet: http://blog.rstudio.org/2015/04/15/readxl-0-1-0/. Also note that opening many files at the same time is likely to cause you memory problems. – MikeRSpencer Sep 04 '15 at 05:40
  • I'm trying to do something similar but getting 0 obs in the final dataframe ```path = "/K_Data/" l = list.files(path = path, "xlsx") all = lapply(l, function(x){ sup = read_excel(i, sheet="Sup", skip = 2) ID = read_excel(x, sheet="Mea", col_names = FALSE) id = as.character( ID[1,1]) mass = as.numeric(ID[3,5]) sup= sup%>% mutate(ID = id, Mass = mass) }) dat = do.call("rbind.data.frame", all) ``` – Mark Davies Mar 21 '21 at 21:55
  • you need to print/return your data at the end of the lapply function. e.g.: ... sup= sup%>% mutate(ID = id, Mass = mass) sup}) dat = do.call("rbind.data.frame", all) – MikeRSpencer Mar 23 '21 at 19:17
1

I am more familiar with a for loop, which can be a bit more cumbersome.

filelist <- list.files(pattern = "\\.xlsx") # list all the xlsx files from the directory

allxlsx.files <- list()  # create a list to populate with xlsx data (if you wind to bind all the rows together)
count <- 1
for (file in filelist) {
   dat <- read.xlsx(file, sheetIndex=1, 
              sheetName=NULL, startRow=5, 
              endRow=NULL, as.data.frame=TRUE, 
              header=TRUE) [c(5:10, 12,15)] # index your columns of interest
   allxlsx.files[[count]] <-dat # creat a list of rows from xls files
   count <- count + 1
}

convert back to data.frame

allfiles <- do.call(rbind.data.frame, allxlsx.files)
Wyldsoul
  • 1,468
  • 10
  • 16
0

For a variation on Wyldsoul's answer, but using a for loop across multiple Excel sheets (between 1 and j) in the same Excel file, and binding with dplyr:

library(gdata) 
library(dplyr)

for (i in 1:j) {
  dat <- read.xls(f, sheet = i) 
  dat <- dat[,1:14] # index your columns of interest
  allxlsx.files[[count]]
  count <- count + 1
}

allfiles <- do.call(bind_rows, allxlsx.files)
  • You need to assign something to `allxlsx.files[[count]]`, probably `dat`, and you can simplify your last line to `bind_rows(allxlsx.files)` – Richard Telford Oct 07 '16 at 17:02