5

My Excel document my.xlsx has two Sheets named Sheet1 and Sheet2. I want to read all worksheets in an Excel workbook using fread function from data.table R package. The following code just read the active worksheet. Wonder how to read all worksheets without knowing their names. Thanks

df3 <- data.table::fread("in2csv my.xlsx")
> names(df3)
[1] "A" "B"
> df3
   A  B
1: 1  2
2: 2  4
3: 3  6
4: 4  8
5: 5 10
MYaseen208
  • 22,666
  • 37
  • 165
  • 309
  • 1
    Can't you use `readxl` or `openxlsx` (https://stackoverflow.com/questions/27713310/easy-way-to-export-multiple-data-frame-to-multiple-excel-worksheets) – akrun Jun 11 '19 at 14:12
  • 2
    Yes, I can use `readxl`, `rio` or `openxlsx`. But the size of my actual file is more than **300MB** that's why requires the use of `data.table::fread`. Any thoughts, please. – MYaseen208 Jun 11 '19 at 14:14
  • 1
    It does not have this functionality. I'm not sure if the fread algorithm would be helpful if adapted to do it, but you could look into it by following the instructions here https://github.com/Rdatatable/data.table/wiki/Support – Frank Jun 11 '19 at 14:17
  • [This](https://stackoverflow.com/a/52339878/9841389) might be of interest. – ismirsehregal Jun 11 '19 at 15:24
  • Save your Excel worksheets as CSV files using Excel or Open Office. Then you can trivially use fread. – Roland Jun 11 '19 at 15:33
  • @Roland: Not able to open **xlsx** file of size more than **400 MB**. – MYaseen208 Jun 11 '19 at 15:39
  • @ismirsehregal: I used the same commands. – MYaseen208 Jun 11 '19 at 15:40
  • 2
    So, how were these files created? And what genius decided that this was a good file format for the task? – Roland Jun 11 '19 at 15:44
  • 1
    Then you may have also seen [this](https://csvkit.readthedocs.io/en/1.0.3/scripts/in2csv.html)? `--write-sheets WRITE_SHEETS The names of the Excel sheets to write to files, or "-" to write all` – ismirsehregal Jun 11 '19 at 15:57

1 Answers1

6

I used openxlsx::read.xlsx the last time I needed to read many sheets from an XLSX.

#install.packages("openxlsx")
library(openxlsx)
#?openxlsx::read.xlsx

#using file chooser:
filename <- file.choose()
#or hard coded file name:
#filename <- "filename.xlsx"

#get all the sheet names from the workbook
SheetNames<-getSheetNames(filename)

# loop through each sheet in the workbook
for (i in SheetNames){

  #Read the i'th sheet
  tmp_sheet<-openxlsx::read.xlsx(filename, i)

  #if the input file exists, append the new data;; else use the first sheet to initialize the input file
  ifelse(exists("input"),
         input<-rbind(input, tmp_sheet),
         input<-tmp_sheet)
}

Note: This assumes each worksheet has identical column structure and data types. You may need to standardize\normalize the data (ex. tmp_sheet <- as.data.frame(sapply(tmp_sheet,as.character), stringsAsFactors=FALSE)), or load each sheet into it's own dataframe and pre-process further before merging.

M.Viking
  • 5,067
  • 4
  • 17
  • 33
  • 1
    Thanks @M.Viking for your answer. I wonder if this can read a workbook of size more than **300 MB**. – MYaseen208 Jun 11 '19 at 15:31
  • 1
    The author of openxlsx says ["Memory issues with Java and xlsx were the reason I wrote this package"](https://github.com/awalker89/openxlsx/issues/73#issuecomment-78234350). - Have you read the data? Did you get errors? – M.Viking Jun 11 '19 at 17:41
  • Python based command line tool [https://github.com/dilshod/xlsx2csv](https://github.com/dilshod/xlsx2csv). "Handles large XLSX files." Note the `-a` argument. – M.Viking Jun 11 '19 at 18:13
  • Thanks @M.Viking for pointing `xlsx2csv`. Would highly appreciate if you provide an example. – MYaseen208 Jun 12 '19 at 01:09
  • Check out the second example here https://linoxide.com/linux-how-to/methods-convert-xlsx-format-files-csv-linux-cli/ . This is at the command line. On Linux I ran `$ sudo pip3 install xlsx2csv`, then `$ xlsx2csv -a filename.xlsx foldername` to make a folder with that name and creates one CSV per each sheet. Alternatively `$ xlsx2csv -a filename.xlsx > filename.csv` makes a single CSV, but adds a header row (ex: `-------- sheetname`) between each sheet of data. – M.Viking Jun 12 '19 at 03:04
  • I used `import xlsx2csv` and `xlsx2csv -a my.xlsx foldername` which throws the following error: `File "", line 1 xlsx2csv -a my.xlsx foldername ^ SyntaxError: invalid syntax`. Any thoughts, please. – MYaseen208 Jun 12 '19 at 04:45
  • Sounds like you are using the Python interpreter. Run this program at the command\terminal\shell prompt. – M.Viking Jun 12 '19 at 12:24