4

I am still suffering every time I deal with excel file in R.

What is the best way to do the following?

1- Import excel in R as a "whole workbook" and be able to do analysis in any sheet in the workbook? if you think about using XLConnect, please bear in mind the "out of memory" problem with Java. I have over 30MB files and dealing with Java memory problem every time consume more time. (running -Xmx does not work for me).

2- Do not miss any data from any excel sheet? saving file into csv says that some sheets are "out of range" which is 65,536 rows and 256 columns. Also it can not deal with some formulas.

3- Do not have to import each sheet separately? Importing sheets to SPSS, STATA or Eviews and save it into their extension and then work with the output file in R works fine most of the time. However, this method has two major problems; one is that you have to have the software downloaded on the machine and the other is that it imports only one sheet at time. If I have over 30 sheets, it will become very time consuming.

This might be an ongoing question that has been answered many many times, however, each answer solving a part of the problem not the whole issue. It is like putting the fire not strategically solving the problem.

I am on Mac OS 10.10 with R 3.1.1

mallet
  • 2,454
  • 3
  • 37
  • 64
  • 6
    Clearly the strategic solution is to stop using Excel altogether. Go on, you know you want to. – Andrie Oct 28 '14 at 14:41
  • As the answer depends on many variables you will need to try out the various possibilities. Here is a list: https://web.archive.org/web/20131109195709/http://rwiki.sciviews.org/doku.php?id=tips:data-io:ms_windows&s=excel – G. Grothendieck Oct 28 '14 at 14:43
  • @Andrie I wish I could. The problem is that I am using some databases and the data set is only available in excel. I wish they see how much people suffer from excel when analysing their data and provide an alternative way. – mallet Oct 28 '14 at 14:45
  • Are you using `.xlsx` files? Or the old `.xls` files? – Barranka Oct 28 '14 at 14:55
  • 1
    Man... I feel your pain. Excel should be used as the last step in an analysis process, but people think it's a damned database. If you have a problem getting data from someone, and they finally send it in an excel file, now you have two problems. – JNevill Oct 28 '14 at 14:56
  • Maybe you should check the [`xlsx` package](http://cran.r-project.org/web/packages/xlsx/index.html) – Barranka Oct 28 '14 at 14:57
  • @Barranka I am now using `.xlsx` but as far as i remember I have experienced the same problems with `xls` files – mallet Oct 28 '14 at 14:59
  • @JNevill I could accept an excel file that developed by an individual, however, I do not really understand why big firms who develop database insist on giving their data this way. They know that you gonna do analysis and most probably will not do that using Excel it self. – mallet Oct 28 '14 at 15:08
  • @Barranka it has the same issue with Java memory – mallet Oct 28 '14 at 15:12
  • @AhmedSalhin 30 MB is not a lot... I think R can handle it (but after importing the data you should close or remove the objects). I think that you can use the `xlsx` package utilities to read and import the data, and then use the [`ff` package](http://cran.r-project.org/web/packages/ff/index.html) to handle the memory issues (`ff` allows you to create objects to disk that behave "almost" as if they were in RAM). Also, be sure to use the 64 bit build if you're handling big objects / data sets. – Barranka Oct 28 '14 at 15:23
  • I have tried a few package to open an excel openxlsx is definitely the best route. It is way faster and more stable than the other ones. The function is : openxlsx::read.xlsx. My advice is to use it to read the whole sheet and then play with the data within R, rather than reading several times part of the sheet. I used it a lot to open large excel files (8000 col plus) for 1000 lines plus, and it always worked well. I use the package xlsx to write in excel, but it had numerous memory issues to read (that's why I moved to openxlsx) – Romain Oct 28 '14 at 15:53
  • @Romain I have just tried `openxlsx::read.xlsx` and it worked fine. Hope it is stable for all excel files. The one I have tried now has 7888 rows and 256 columns and it is kind of medium sized file in my data set. I will try it with much larger one and see. – mallet Oct 28 '14 at 22:03
  • I never had any issues with it so far. I hope it will fit your needs as well – Romain Oct 28 '14 at 22:07

2 Answers2

5

I have tried a few package to open an excel openxlsx is definitely the best route. It is way faster and more stable than the other ones. The function is : openxlsx::read.xlsx. My advice is to use it to read the whole sheet and then play with the data within R, rather than reading several times part of the sheet. I used it a lot to open large excel files (8000 col plus) for 1000 lines plus, and it always worked well. I use the package xlsx to write in excel, but it had numerous memory issues to read (that's why I moved to openxlsx)

-Add In On a side note, if you want to use R with excel you sometimes need to execute a VBA code from R. I found the procedure to be quite difficult to achieve. I fully documented the proper way of doing it in a previous question in stack : Apply VBA from R

Community
  • 1
  • 1
Romain
  • 839
  • 10
  • 24
0

Consider using the xlsx package. It has methods for dealing with excel files and worksheets. Your question is quite broad, but I think this can be an example:

library(xlsx)
wb <- loadWorkbook('r_test.xlsx')
sheets <- getSheets(wb)
sheet <- sheets[[1]]
df <- readColumns(sheet, 
                  startColumn =  1, endColumn =  3, 
                  startRow = 1, endRow = 6)
df
##  id name x_value
##1  1    A      10
##2  2    B      15
##3  3    C      20
##4  4    D      13
##5  5    E      17

As for the memory issue I think you should check the ff package:

The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory.


Another option (but it may be overkill) would be to load the data to a real database and deal with database connections. If you are dealing with really big datasets, a database may be the best approach.

Some options would be:

  • The RSQLite package

    If you can load your data to an SQLite database, you can use this package to connect directly to that database and handle the data directly. That would "split" the workload between R and the database engine. SQLite is quite easy to use and (almost) "config free", and each SQLite database is stored in a single file.

  • The RMySQL package

    Even better than the above option; MySQL is great for storing large datasets. However you'll need to install and configure a MySQL server in your computer.

Remember: If you work with R and a database, delegate as much heavy workload to the database (e.g. data filtering, aggregation, etc), and use R to get the final results.

Barranka
  • 20,547
  • 13
  • 65
  • 83
  • `xlsx` will return the same memory problem like `XLConnect`. Using `ff` package or any memory solution overcomplicate the issue. – mallet Oct 28 '14 at 21:58
  • @AhmedSalhin then maybe the database solution is a better approach... of course you'll need to write a program to send the data from excel to the database... but considering the memory problems you're facing, it may be a good option – Barranka Oct 28 '14 at 22:44