Read an Excel file directly from a R script

Question

How can I read an Excel file directly into R? Or should I first export the data to a text- or CSV file and import that file into R?

@Sacha Epskamp : with xlsReadWrite, you don't even need to install Perl. — Joris Meys, May 23 '11 at 15:42
related : http://stackoverflow.com/questions/1848331/whats-a-robust-method-in-r-for-importing-from-and-exporting-data-to-excel — Joris Meys, May 23 '11 at 15:48
Not tested but there are also [xlsx](http://cran.r-project.org/package=xlsx) (based on Java) and [WriteXLS](http://cran.r-project.org/package=WriteXLS) (based on Perl) packages. — Marek, May 24 '11 at 08:19
`gdata` version 2.8.2 reads `xlsx` files with the `read.xls` function — Ben, May 05 '12 at 05:47
See my warning (as an answer below) regarding the possible loss of precision if you export the data to a text format from Excel. — russellpierce, Oct 08 '14 at 20:08
`xlsx` package for xlsx/xlsm/xls, don't know about xlam etc. — Qbik, Feb 05 '15 at 19:51
"I never see a reason not to export to a text file first". How about this: When I export to CSV, one of the fields I need doesn't get written. It seems like some kind of DRM but since I didn't write the spreadsheet I don't know. — Nate Reed, Dec 22 '15 at 22:28

score 49 · Answer 1 · answered May 23 '11 at 16:51

Let me reiterate what @Chase recommended: Use XLConnect.

The reasons for using XLConnect are, in my opinion:

Cross platform. XLConnect is written in Java and, thus, will run on Win, Linux, Mac with no change of your R code (except possibly path strings)
Nothing else to load. Just install XLConnect and get on with life.
You only mentioned reading Excel files, but XLConnect will also write Excel files, including changing cell formatting. And it will do this from Linux or Mac, not just Win.

XLConnect is somewhat new compared to other solutions so it is less frequently mentioned in blog posts and reference docs. For me it's been very useful.

Ben · Answer 2 · 2015-04-30T02:53:49.640

48

And now there is readxl:

The readxl package makes it easy to get data out of Excel and into R. Compared to the existing packages (e.g. gdata, xlsx, xlsReadWrite etc) readxl has no external dependencies so it's easy to install and use on all operating systems. It is designed to work with tabular data stored in a single sheet.

readxl is built on top of the libxls C library, which abstracts away many of the complexities of the underlying binary format.

It supports both the legacy .xls format and .xlsx

readxl is available from CRAN, or you can install it from github with:

# install.packages("devtools")
devtools::install_github("hadley/readxl")

Usage

library(readxl)

# read_excel reads both xls and xlsx files
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")

# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)

# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")

Note that while the description says 'no external dependencies', it does require the Rcpp package, which in turn requires Rtools (for Windows) or Xcode (for OSX), which are dependencies external to R. Though many people have them installed for other reasons.

edited Apr 30 '15 at 02:53

answered Mar 17 '15 at 19:54

Ben

41,615
18
132
227

3

Much faster than xlsx. Read time are like read.xlsx2, but it infers types. – Steve Rowe May 19 '15 at 18:45
1

@SteveRowe see new answer for some (attempted) objective benchmarks confirming this – MichaelChirico Jul 30 '15 at 21:18
Is there a way to read strings as factors using `read_excel`? I like the speed compared to `read.xlsx`, but having to then manually convert columns from characters to factors defeats the purpose. – coip Aug 20 '15 at 21:15
2

+1 for the fact that it has no dependencies. I hate to have to install java. And I have tried it and it works very well for me. – Bastian Sep 08 '15 at 06:43
2

readxl and openxlsx are the best. readxl is faster but it doesn't allow to write. Anyway, none of them works well when trying to specify column classes/types. – skan Oct 15 '15 at 17:55
@colp why do you want to do that? see [here](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) – MichaelChirico Sep 05 '16 at 13:45
@MichaelChirico Because factors are more useful for the analyses I want to do. This is a decent explanation: http://stackoverflow.com/a/3445486/4342901 – coip Apr 26 '17 at 21:13
@coip forgive me for being curt, as in fact I do use factors often. However, whenever I use factors, it's because I need non-alphabetical ordering, so I have to do some manual tinkering, which means I don't have any problem with converting those columns by hand after read-in. And I hardly think having to do so is 1) slow or 2) "defeats the purpose" – MichaelChirico Apr 26 '17 at 21:43

score 41 · Accepted Answer · edited Sep 03 '14 at 23:56

41

Yes. See the relevant page on the R wiki. Short answer: read.xls from the gdata package works most of the time (although you need to have Perl installed on your system -- usually already true on MacOS and Linux, but takes an extra step on Windows, i.e. see http://strawberryperl.com/). There are various caveats, and alternatives, listed on the R wiki page.

The only reason I see not to do this directly is that you may want to examine the spreadsheet to see if it has glitches (weird headers, multiple worksheets [you can only read one at a time, although you can obviously loop over them all], included plots, etc.). But for a well-formed, rectangular spreadsheet with plain numbers and character data (i.e., not comma-formatted numbers, dates, formulas with divide-by-zero errors, missing values, etc. etc. ..) I generally have no problem with this process.

edited Sep 03 '14 at 23:56

David LeBauer

31,011
31
115
189

answered May 23 '11 at 15:22

Ben Bolker

211,554
25
370
453

6

There are a lot of potential problems to consider that I've run into personally. Fields with numbers with comma's need to be stripped and converted to numeric in R. Fields with "-" need to be recoded to NA. Overall recommendation is to really look at your numbers in Excel and ensure that they are being translated correctly into R. – Brandon Bertelsen May 23 '11 at 17:06
3

Can't argue with "you really need to look at your numbers" ... what is the issue with "-" fields? does `na.strings="-"` address the problem? How many of these issues are generic and how many of them (e.g. numeric fields with commas) can be addressed with other tools such as XLConnect ...? – Ben Bolker May 23 '11 at 18:26
1

That comment was directed to the OP, not at you Ben, my fault for bad placement. – Brandon Bertelsen May 23 '11 at 18:39
1

Relevant anecdote: `read.xls` succeeding in reading a very large sheet from an Excel file where both `XLConnect` and `xlsx` failed (I believe because they both rely on [Apache POI](http://poi.apache.org/)) – Matt Parker Jun 23 '14 at 23:50

Mikko · Answer 4 · 2015-10-26T16:07:12.057

EDIT 2015-October: As others have commented here the openxlsx and readxl packages are by far faster than the xlsx package and actually manage to open larger Excel files (>1500 rows & > 120 columns). @MichaelChirico demonstrates that readxl is better when speed is preferred and openxlsx replaces the functionality provided by the xlsx package. If you are looking for a package to read, write, and modify Excel files in 2015, pick the openxlsx instead of xlsx.

Pre-2015: I have used xlsxpackage. It changed my workflow with Excel and R. No more annoying pop-ups asking, if I am sure that I want to save my Excel sheet in .txt format. The package also writes Excel files.

However, I find read.xlsx function slow, when opening large Excel files. read.xlsx2 function is considerably faster, but does not quess the vector class of data.frame columns. You have to use colClasses command to specify desired column classes, if you use read.xlsx2 function. Here is a practical example:

read.xlsx("filename.xlsx", 1) reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets. Works also for .xls files.

read.xlsx2("filename.xlsx", 1) is faster, but you will have to define column classes manually. A shortcut is to run the command twice (see the example below). character specification converts your columns to factors. Use Dateand POSIXct options for time.

coln <- function(x){y <- rbind(seq(1,ncol(x))); colnames(y) <- colnames(x)
rownames(y) <- "col.number"; return(y)} # A function to see column numbers

data <- read.xlsx2("filename.xlsx", 1) # Open the file 

coln(data)    # Check the column numbers you want to have as factors

x <- 3 # Say you want columns 1-3 as factors, the rest numeric

data <- read.xlsx2("filename.xlsx", 1, colClasses= c(rep("character", x),
rep("numeric", ncol(data)-x+1)))

MichaelChirico · Answer 5 · 2017-10-24T14:40:33.603

Given the proliferation of different ways to read an Excel file in R and the plethora of answers here, I thought I'd try to shed some light on which of the options mentioned here perform the best (in a few simple situations).

I myself have been using xlsx since I started using R, for inertia if nothing else, and I recently noticed there doesn't seem to be any objective information about which package works better.

Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others, and a waterfall of other caveats.

That said, I'm using a (reproducible) data set that I think is in a pretty common format (8 string fields, 3 numeric, 1 integer, 3 dates):

set.seed(51423)
data.frame(
  str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
  str2 = sample(sprintf("%09d", 1:NN)),  #ID field 2
  #varying length string field--think names/addresses, etc.
  str3 = 
    replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
                         collapse = "")),
  #factor-like string field with 50 "levels"
  str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
  #factor-like string field with 17 levels, varying length
  str5 = 
    sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
                                 collapse = "")), NN, TRUE),
  #lognormally distributed numeric
  num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
  #3 binary strings
  str6 = sample(c("Y","N"), NN, TRUE),
  str7 = sample(c("M","F"), NN, TRUE),
  str8 = sample(c("B","W"), NN, TRUE),
  #right-skewed integer
  int1 = ceiling(rexp(NN)),
  #dates by month
  dat1 = 
    sample(seq(from = as.Date("2005-12-31"), 
               to = as.Date("2015-12-31"), by = "month"),
           NN, TRUE),
  dat2 = 
    sample(seq(from = as.Date("2005-12-31"), 
               to = as.Date("2015-12-31"), by = "month"),
           NN, TRUE),
  num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
  #date by day
  dat3 = 
    sample(seq(from = as.Date("2015-06-01"), 
               to = as.Date("2015-07-15"), by = "day"),
           NN, TRUE),
  #lognormal numeric that can be positive or negative
  num3 = 
    (-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
)

I then wrote this to csv and opened in LibreOffice and saved it as an .xlsx file, then benchmarked 4 of the packages mentioned in this thread: xlsx, openxlsx, readxl, and gdata, using the default options (I also tried a version of whether or not I specify column types, but this didn't change the rankings).

I'm excluding RODBC because I'm on Linux; XLConnect because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks, so to put its horse in the race on only its reading capabilities seems unfair; and xlsReadWrite because it is no longer compatible with my version of R (seems to have been phased out).

I then ran benchmarks with NN=1000L and NN=25000L (resetting the seed before each declaration of the data.frame above) to allow for differences with respect to Excel file size. gc is primarily for xlsx, which I've found at times can create memory clogs. Without further ado, here are the results I found:

1,000-Row Excel File

benchmark1k <-
  microbenchmark(times = 100L,
                 xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
                 openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
                 readxl = {readxl::read_excel(fl); invisible(gc())},
                 gdata = {gdata::read.xls(fl); invisible(gc())})

# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#      xlsx  194.1958  199.2662  214.1512  201.9063  212.7563  354.0327   100
#  openxlsx  142.2074  142.9028  151.9127  143.7239  148.0940  255.0124   100
#    readxl  122.0238  122.8448  132.4021  123.6964  130.2881  214.5138   100
#     gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345   100

So readxl is the winner, with openxlsx competitive and gdata a clear loser. Taking each measure relative to the column minimum:

#       expr   min    lq  mean median    uq   max
# 1     xlsx  1.59  1.62  1.62   1.63  1.63  1.65
# 2 openxlsx  1.17  1.16  1.15   1.16  1.14  1.19
# 3   readxl  1.00  1.00  1.00   1.00  1.00  1.00
# 4    gdata 16.43 16.62 15.77  16.67 16.25 11.31

We see my own favorite, xlsx is 60% slower than readxl.

25,000-Row Excel File

Due to the amount of time it takes, I only did 20 repetitions on the larger file, otherwise the commands were identical. Here's the raw data:

# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval
#      xlsx  4451.9553  4539.4599  4738.6366  4762.1768  4941.2331  5091.0057    20
#  openxlsx   962.1579   981.0613   988.5006   986.1091   992.6017  1040.4158    20
#    readxl   341.0006   344.8904   347.0779   346.4518   348.9273   360.1808    20
#     gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826    20

Here's the relative data:

#       expr    min     lq   mean median     uq    max
# 1     xlsx  13.06  13.16  13.65  13.75  14.16  14.13
# 2 openxlsx   2.82   2.84   2.85   2.85   2.84   2.89
# 3   readxl   1.00   1.00   1.00   1.00   1.00   1.00
# 4    gdata 128.62 128.67 129.22 129.86 129.69 126.75

So readxl is the clear winner when it comes to speed. gdata better have something else going for it, as it's painfully slow in reading Excel files, and this problem is only exacerbated for larger tables.

Two draws of openxlsx are 1) its extensive other methods (readxl is designed to do only one thing, which is probably part of why it's so fast), especially its write.xlsx function, and 2) (more of a drawback for readxl) the col_types argument in readxl only (as of this writing) accepts some nonstandard R: "text" instead of "character" and "date" instead of "Date".

It would be great if you also add the benchmark for XLConnect. Also comment that readxl it's not able to write. xlsx and openxlsx don't work properly with the col_types or colClasses option. — skan, Sep 15 '15 at 10:36
@skan I initially ran some tests with `XLConnect` but it is very slow; I believe `readxl`'s drawbacks were sufficiently covered in my final paragraph; and I have no similar experience to yours with `xlsx` or `openxlsx` as I regularly use both to specify types. — MichaelChirico, Sep 15 '15 at 13:21

score 19 · Answer 6 · answered May 23 '11 at 15:39

19

I've had good luck with XLConnect: http://cran.r-project.org/web/packages/XLConnect/index.html

answered May 23 '11 at 15:39

Chase

67,710
18
144
161

3

In other SE networks this answer would be closed for being link-only. – luchonacho Dec 12 '16 at 08:45

score 13 · Answer 7 · answered May 23 '11 at 16:12

13

library(RODBC)
file.name <- "file.xls"
sheet.name <- "Sheet Name"

## Connect to Excel File Pull and Format Data
excel.connect <- odbcConnectExcel(file.name)
dat <- sqlFetch(excel.connect, sheet.name, na.strings=c("","-"))
odbcClose(excel.connect)

Personally, I like RODBC and can recommend it.

answered May 23 '11 at 16:12

Brandon Bertelsen

43,807
34
160
255

7

Caveat: ODBC can sometimes be tricky to get running on platforms other than Windows. – JD Long May 23 '11 at 16:52
1

@JD Long and even on windows it's a PITA. No sexy time for me and ODBC on 64 bit W7... – Roman Luštrik Aug 15 '11 at 07:38
4

Loading required package: RODBC Error in odbcConnectExcel(file.name) : odbcConnectExcel is only usable with 32-bit Windows – andrekos Oct 07 '13 at 07:20

score 7 · Answer 8 · answered Feb 05 '15 at 19:49

7

Just gave the package openxlsx a try today. It worked really well (and fast).

http://cran.r-project.org/web/packages/openxlsx/index.html

answered Feb 05 '15 at 19:49

Chris

3,401
5
33
42

score 6 · Answer 9 · edited Apr 17 '12 at 08:22

Another solution is the xlsReadWrite package, which doesn't require additional installs but does require you download the additional shlib before you use it the first time by :

require(xlsReadWrite)
xls.getshlib()

Forgetting this can cause utter frustration. Been there and all that...

On a sidenote : You might want to consider converting to a text-based format (eg csv) and read in from there. This for a number of reasons :

whatever your solution (RODBC, gdata, xlsReadWrite) some strange things can happen when your data gets converted. Especially dates can be rather cumbersome. The HFWutils package has some tools to deal with EXCEL dates (per @Ben Bolker's comment).
if you have large sheets, reading in text files is faster than reading in from EXCEL.
for .xls and .xlsx files, different solutions might be necessary. EG the xlsReadWrite package currently does not support .xlsx AFAIK. gdata requires you to install additional perl libraries for .xlsx support. xlsx package can handle extensions of the same name.

@Ben Thx for the tip, I'll include it in my answer. I didn't try to be complete though, as the wiki page the accepted answer links to is already rather complete. But it doesn't mention the HFWutils package. — Joris Meys, May 23 '11 at 21:00
-1; See my answer. TL:DR: Excel does not save a full precision dataset to csv (or the clipboard). Only the visible values are retained. — russellpierce, Oct 08 '14 at 20:03

score 5 · Answer 10 · answered Oct 08 '14 at 20:02

5

As noted above in many of the other answers, there are many good packages that connect to the XLS/X file and get the data in a reasonable way. However, you should be warned that under no circumstances should you use the clipboard (or a .csv) file to retrieve data from Excel. To see why, enter =1/3 into a cell in excel. Now, reduce the number of decimal points visible to you to two. Then copy and paste the data into R. Now save the CSV. You'll notice in both cases Excel has helpfully only kept the data that was visible to you through the interface and you've lost all of the precision in your actual source data.

answered Oct 08 '14 at 20:02

russellpierce

4,583
2
32
44

1

I'd love to know who thought numeric fidelity wasn't relevant/important. – russellpierce Aug 10 '15 at 06:40
1

Good warning. However it depends on the package you use. some are not falling into this trap. – RockScience Dec 22 '16 at 08:55
@RocketScience This is a fundamental design choice in the CSV export by Excel. Do you have an example of a package that uses the clipboard that doesn't have this issue? Packages directly parsing the xls xlsx file would be prone to not fall into this trap. Thus, the specific scoping of my warning. – russellpierce Dec 22 '16 at 16:43
on Unix you can try `gnumeric::read.gnumeric.sheet`. On Windows I am not 100% sure but I think `gdata::read.xls` should also work well (need to install perl though) – RockScience Dec 23 '16 at 02:07
@RockScience Respectfully, gdata::read.xls doesn't operate on clipboard data (unless you go out of your way) and gnumeric isn't Excel. So, absent extraordinary proof, my warning stands. There are so many other good options in response to this question - let's hope that people use those instead. – russellpierce Dec 23 '16 at 02:14

score 5 · Answer 11 · answered Oct 28 '14 at 13:31

5

Expanding on the answer provided by @Mikko you can use a neat trick to speed things up without having to "know" your column classes ahead of time. Simply use read.xlsx to grab a limited number of records to determine the classes and then followed it up with read.xlsx2

Example

# just the first 50 rows should do...
df.temp <- read.xlsx("filename.xlsx", 1, startRow=1, endRow=50) 
df.real <- read.xlsx2("filename.xlsx", 1, 
                      colClasses=as.vector(sapply(df.temp, mode)))

answered Oct 28 '14 at 13:31

JasonAizkalns

20,243
8
57
116

1

Your solution returns `numeric` for `factors` on my computer. `read.xlsx` uses `character` in `readColumns` function to specify factors. I am sure there is a more elegant way of getting factors as characters, but here is a modified version of your function that works: `df.real <- read.xlsx2("filename.xlsx", 1, colClasses=gsub("factor", "character", as.vector(sapply(df.temp, class))))`. – Mikko Jul 15 '15 at 11:42
This only works to the extent that column type can be adequately determined by the first row. Usually one parses more than the first row to make that determination. The answer could be improved to address aaa90210's comment by naming the package these functions come from. – russellpierce Jul 17 '17 at 11:51

Ashok Kumar Jayaraman · Answer 12 · 2018-08-30T12:45:40.557

1

An Excel file can be read directly into R as follows:

my_data <- read.table(file = "xxxxxx.xls", sep = "\t", header=TRUE)

Reading xls and xlxs files using readxl package

library("readxl")
my_data <- read_excel("xxxxx.xls")
my_data <- read_excel("xxxxx.xlsx")

edited Aug 30 '18 at 12:45

answered Aug 30 '18 at 12:33

Ashok Kumar Jayaraman

2,887
2
32
40

Read an Excel file directly from a R script

12 Answers12

1,000-Row Excel File

25,000-Row Excel File

Linked

Related