3

I have data in .xlsx files that have a header structure spread over two rows:

rowid   CATA      CATB    CATC
        A1 A2 A3  B1 B2 B3  C1 C2 
1       1  1  2   2  3  5   5  6 
...

Furthermore, the number of columns (CATA CATB etc) in the first header can change across files and also the number of columns in the second header colA1 ... colC2).

In excel the first header is indicated with merged cells, delimiting the range of columns in the second header.

I have about hundred files so I would like to have an algorithm (no manual work) for getting the data structured as:

Rowid Cat  Col val 
1     CATA A1  1 
1     CATA A2  1 
1     CATA A3  2
1     CATB B1  2 
1     CATB B2  3 
1     CATB B3  5 
1     CATC C1  5 
1     CATC C2  6

What is the best way to do this in R?

Florian
  • 24,425
  • 4
  • 49
  • 80
Rense
  • 91
  • 5
  • 1
    SO isn't exactly a coding service. What have you tried so far? Where do you get stuck? A reproducible example will go a long way, too. – Roman Luštrik Jan 14 '18 at 09:30
  • I don't understand what you mean with the words "coding service". The data is from a realworld case (much more elaborate however). I'm stuck at handling the merged cell range in R. – Rense Jan 14 '18 at 09:42
  • In a nutshell: 1) remove the first line ("Cat"), 2) `melt` your data and finally 3) add "Cat" column using a `join` (or `merge` ) function. – DJack Jan 14 '18 at 09:48

1 Answers1

4

Here is a possible solution. We read the xlsx file without headers, fill the missing values in the first row with na.locf from zoo and create a new header that exists of the top two rows combined, i.e. CATA---A1, CATA---A2, CATB---B1, etc. We then use melt to reshape this dataframe into long format, and use separate to split our custom header back in Cat and Col.

I hope this helps!


test.xlsx

enter image description here


library(xlsx)
library(zoo)
library(reshape2)
library(tidyr)

read_my_xlsx <- function(xlsx_name,sheet_id)
{
  my_df <- xlsx::read.xlsx(xlsx_name,sheetIndex=sheet_id,header=F,colClasses='character',stringsAsFactors=FALSE)
  my_df[1,] = na.locf(as.character(unlist(my_df[1,])))
  my_df[1,] = c(my_df[1,1] ,sapply(2:ncol(my_df),function(x) paste0(my_df[1,x],'---',my_df[2,x])))
  colnames(my_df) = my_df[1,]
  my_df = my_df[-c(1,2),]
  my_df = melt(my_df, id.vars=c("rowid"))
  my_df = separate(my_df,variable, c("Cat", "Col"), "---")
  return(my_df)
}

read_my_xlsx('test.xlsx',1)

Output:

   rowid  Cat Col value
1      1 CATA  A1     1
2      2 CATA  A1     1
3      3 CATA  A1     3
4      1 CATA  A2     1
5      2 CATA  A2     4
6      3 CATA  A2     3
7      1 CATB  B1     4
8      2 CATB  B1     2
9      3 CATB  B1     1
10     1 CATB  B2     1
11     2 CATB  B2     1
12     3 CATB  B2     4
13     1 CATB  B3     1
14     2 CATB  B3     2
15     3 CATB  B3     3
16     1 CATC  C1     4
17     2 CATC  C1     2
18     3 CATC  C1     1
19     1 CATC  C2     1
20     2 CATC  C2     2
21     3 CATC  C2     3
Florian
  • 24,425
  • 4
  • 49
  • 80
  • if you are using `tidyr` anyway, you could also use `gather` instead of `melt` ;-) – Jaap Jan 14 '18 at 09:56
  • 1
    Thanks @Jaap, that would be nicer indeed, I have never used that though... There was a point in time I kept coming back to your answer [here](https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format), and in the end I guess I just decided to always use the first one, haha. – Florian Jan 14 '18 at 10:12
  • This looks very promising, many thanks! Except the part of rowid is not working properly. – Rense Jan 14 '18 at 10:45
  • You can find real world data at https://www.dropbox.com/s/u9ogvw1tn2en3bc/test.xlsx?dl=0. The rowid should be the first column in the spreadsheet. – Rense Jan 14 '18 at 10:48
  • 1
    but that problem is easily solved by omitting the line my_df$value = as.numeric(my_df$value). Thank you've saved me hours of trying. – Rense Jan 14 '18 at 11:19
  • No problem, glad I could help. And great that you were able to resolve the last issue yourself. Also, please note that you should replace the separator `-` with something that does not occur in your data, e.g. `---`. I updated the answer to reflect that. – Florian Jan 14 '18 at 11:46