1

This is a "let's try another way" post that is related to this one:

Is it possible to define/modify a reading function that

  • can handle the fact that data stored in an xlsx file is row-based (i.e. each row represents a variable)

  • and transforms it accordingly so it can be stored in a column-based data.frame (i.e. what used to be a row in xlsx becomes a column)

  • while capturing the underlying class/data type of the row-based variables?

Regarding csv files I would probably start with turning to readLines, but unfortunately xlsx is still a black box to me.

Here's a little xlsx file that features examples for both data orientations: https://github.com/rappster/stackoverflow/blob/master/excel/row-and-column-based-data.xlsx

Community
  • 1
  • 1
Rappster
  • 12,762
  • 7
  • 71
  • 120

3 Answers3

3

What about slightly modifying the read.xlsx function from the xlsx package:

library(xlsx)
read.transposed.xlsx <- function(file,sheetIndex) {
        df <- read.xlsx(file, sheetIndex = sheetIndex , header = FALSE)
        dft <- as.data.frame(t(df[-1]), stringsAsFactors = FALSE) 
        names(dft) <- df[,1] 
        dft <- as.data.frame(lapply(dft,type.convert))
        return(dft)            
}

# Let's test it
read.transposed.xlsx("row-and-column-based-data.xlsx", sheetIndex = 2)
#    variable var_1 var_2 var_3
#1 2016-01-01     1     a  TRUE
#2 2016-01-02     2     b FALSE
#3 2016-01-03     3     c  TRUE
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • Thanks for taking the time. The problem is reading/capturing the underlying class information of the variables when they are row-based instead of column-based: `var_1` should end up being `numeric`, `var_2` can remain `character` and `var_3` needs to be `logical` in the end. That's exactly the tricky part. – Rappster Feb 05 '16 at 23:29
  • answer updated, we can use `type.convert` from `utils` to convert character cols to appropriate classes. – mtoto Feb 05 '16 at 23:37
  • You are awesome! `type.convert` was **exactly** the missing puzzle piece! Never heard of it before, thank you so much!! – Rappster Feb 05 '16 at 23:51
  • 1
    You can simplify things a little bit more by `dft <- as.data.frame(t(df[-1]), stringsAsFactors = FALSE)` and then `dft <- as.list(dft)` instead of the call to `lapply()` – Rappster Feb 06 '16 at 00:03
  • If this approach leads to an error, I used an alternative solution - see my answer [here](https://stackoverflow.com/a/60750785/4927395) – Mark Neal Mar 19 '20 at 04:11
1
   > library(openxlsx)
> library(reshape)
> x=read.xlsx("row-and-column-based-data.xlsx",sheet = 2);
> x
  variable 2016-01-01 2016-01-02 2016-01-03
1    var_1          1          2          3
2    var_2          a          b          c
3    var_3       TRUE      FALSE       TRUE
> y=t(x)
> colnames(y)=y[1,]
> y=y[2:nrow(y),]
> cc=data.frame(y, stringsAsFactors = F)
> cc
           var_1 var_2 var_3
2016-01-01     1     a  TRUE
2016-01-02     2     b FALSE
2016-01-03     3     c  TRUE
> sapply(cc, class)
      var_1       var_2       var_3 
"character" "character" "character" 
> write.csv(cc,"temp.csv")
> bb=read.csv("temp.csv")  #infer magically types
> bb
           X var_1 var_2 var_3
1 2016-01-01     1     a  TRUE
2 2016-01-02     2     b FALSE
3 2016-01-03     3     c  TRUE
> sapply(bb, class)
        X     var_1     var_2     var_3 
 "factor" "integer"  "factor" "logical" 

or use stringsAsFactors=F if you prefer character data type:

> bb=read.csv("temp.csv", stringsAsFactors = F)  #infer magically types
> bb
           X var_1 var_2 var_3
1 2016-01-01     1     a  TRUE
2 2016-01-02     2     b FALSE
3 2016-01-03     3     c  TRUE
> sapply(bb, class)
          X       var_1       var_2       var_3 
"character"   "integer" "character"   "logical" 
arturro
  • 1,598
  • 1
  • 10
  • 13
  • Sorry, I forgot to stress the point that the underlying class/data type information needs to be captured: values `"TRUE"` and `"FALSE"` should be `logical`s, the numeric values should be `numeric`s. – Rappster Feb 05 '16 at 23:07
  • I also think you have an old version of the example file, sorry for that! I've updated it about 20 minutes ago. – Rappster Feb 05 '16 at 23:08
  • Thanks a lot for the effort! I think we found a great way based on `type.convert` - really a neat little helper function! – Rappster Feb 05 '16 at 23:52
1

You can also try this utility with this code

install.packages("remotes")
remotes::install_github("atusy/mytools")
library(mytools)
my_df <- read_excel2("my_excel_file.xlsx", sheet = 1, transposing = TRUE, error_as_NA = TRUE, rm_blank_col = TRUE)

My excel sheet had the desired column headers in the second column, which meant they ended up in the first row, which I then fixed with janitor, with the method described here

x %>%
  row_to_names(row_number = 1)
Mark Neal
  • 996
  • 16
  • 52