19

I received hundreds of Excel sheets containing merged cells. Sender insists on using Excel and merging cells - nothing I can do about that. How do I read these using R? For example, a simplified version of the problem area of the input sheet might look something like this, where the merged cells (B2,B3,C2,C3) contain the word "X". The number of merged cells and their location in the sheet (and the value of "X") changes from sheet to sheet, and there may be more than one set of merged cells in the same sheet. The sheets are not actually in tabular format, and they contain other empty cells. I have successfully looped through all the files, cleaned up the whole mess, reshaped the result and obtained a tidy dataset (1 sheet instead of 736 Excel workbooks). The problem is, my solution so far ignores the information in the merged cells.

    A   B   C   D
1   a   f   i   l
2   b   X       m
3   c           n
4   d   g   j   o
5   e   h   k   p

How can I read the Excel sheet into R so that the result looks like this, with the word "X"

    A   B   C   D
1   a   f   i   l
2   b   X   X   m
3   c   X   X   n
4   d   g   j   o
5   e   h   k   p
Dave Stumped
  • 319
  • 1
  • 3
  • 12
  • May be [this](http://yihui.name/en/2009/09/how-to-import-ms-excel-data-into-r/) would be helpful – akrun May 29 '16 at 12:30
  • Excel VBA is best at munging Excel data. You could write a VBA macro to unmerge the cells, duplicating the contents, saving the files as csv files, run the files through that macro and then open them in R. – John Coleman May 29 '16 at 13:03
  • This work-in-progress is relevant here: https://github.com/rsheets/rexcel/blob/master/README.md – Ben May 29 '16 at 13:19
  • How can you be sure that the merged area is B2:C3 and not two merged areas: B2:C2 and A3:C3 ? This could be very difficult to detect automatically. – R. Schifini May 29 '16 at 14:31

5 Answers5

28
library(openxlsx)

data <- read.xlsx(xlsxFile = "Your path", fillMergedCells = TRUE, colNames = FALSE)

fillMergedCells = TRUE

Try this!

Megatron
  • 15,909
  • 12
  • 89
  • 97
MaazKhan47
  • 811
  • 1
  • 11
  • 22
10

If a VBA/R hybrid suits your purposes, here is a VBA macro which will unmerge all cells in a worksheet, while simultaneously filling all cells in the unmerged region with the corresponding value:

Sub UnMerge(ws As Worksheet)
    Dim R As Range, c As Range
    Dim v As Variant
    For Each c In ws.UsedRange
        If c.MergeCells Then
            v = c.Value
            Set R = c.MergeArea
            R.UnMerge
            R.Value = v
        End If
    Next c
End Sub

A simple test to show how it is called:

Sub test()
    UnMerge Sheets(1)
End Sub

The sub UnMerged can be used as part of a larger program that e.g. iterates over all .xlsx files in a folder and all data-containing sheets in the files, unmerging them all and saving them as .csv files.

On Edit. Native VBA file handling is somewhat annoying. I tend to use the related scripting language VBScript if I need to iterate over multiple files. I'm not sure if your virtual Windows can handle VBScript. I would assume so since VBScript is a standard part of the Windows OS. If this is the case, see if the following works (after backing up the files just to be safe). Save the code as a simple text file with a .vbs extension in the folder that contains the Excel files that you want to modify. Then, simply click its icon. It will iterate over all .xlx and .xlsx files in the directory that contains the script and unmerge sheet 1 in each such file. I didn't test it extensively and it contains no error-handling, but I did test it on a folder with three Excel files which each contained multiple merged regions and it ran as expected on my Windows machine. I don't know if it will work on your Mac:

Option Explicit

Dim fso,fol,f,xl, wb, ws,ext,v,r,c

Set fso = WScript.CreateObject("Scripting.FileSystemObject")
Set xl = CreateObject("Excel.Application")
xl.DisplayAlerts = False
xl.ScreenUpdating = False
set fol = fso.GetFolder(fso.GetParentFolderName(WScript.ScriptFullName))

For Each f In fol.Files
    ext = LCase(fso.GetExtensionName(f.Name))
    If ext = "xls" Or ext = "xlsx" Then
        Set wb = xl.Workbooks.Open(f.Path)
        Set ws = wb.Sheets(1)
        For Each c In ws.UsedRange
            If c.MergeCells Then
                v = c.Value
                Set R = c.MergeArea
                R.UnMerge
                R.Value = v
            End If
        Next
        wb.Save
        wb.Close   
    End If
Next
John Coleman
  • 51,337
  • 7
  • 54
  • 119
  • I'm not at all familiar with VBA. How do I loop through all files in a folder to do this? – Dave Stumped May 29 '16 at 14:03
  • @DaveStumped Is there 1 worksheet per file or do some of the files have multiple worksheets? Also -- are you on a Windows machine? (It is harder to process Excel sheets in something like Linux or even Macs despite the existence of Mac versions of Excel). – John Coleman May 29 '16 at 14:12
  • At the moment, there are multiple sheets in each workbook. However, so far at least, there is only one sheet in each workbook that I'm interested in (that mercifully all have the same name in each workbook), and I've successfully moved all those to sheet 1 in each workbook that is saved on disk, all of them in the same folder. I'm working on a MAC, but I can run a virtual Windows machine that has Excel 2010 (64 bit). – Dave Stumped May 29 '16 at 14:26
  • @DaveStumped Glad it works. I threw in the line `xl.Quit` at the bottom of the script to be safe. Otherwise you might have an instance of Excel running in the background after the script is done. – John Coleman May 29 '16 at 17:12
  • 1
    I created a followup question for the next step here http://stackoverflow.com/questions/37512950/r-find-metadata-at-top-of-excel-spreadsheet-and-add-to-columns-in-table – Dave Stumped May 29 '16 at 17:41
3

This solution assumes that there is only one merged area on a spreadsheet, and that the only missing values (NAs) are due to the merged cells.

Code:

library("openxlsx")
data = read.xlsx(xlsxFile = "Book1.xlsx", colNames = F)

cl = min(ceiling(which(is.na(data))/dim(data)[1]))
rw = min(which(is.na(data))%%dim(data)[1])

data[is.na(data)] = data[rw,cl]

Example:

Data read from an excel with merged cells:

   X1   X2   X3   X4  X5
1   1    a    q    a  11
2   2    b    w    s  22
3   3    c    e    d  33
4   4    d <NA> <NA>  44
5   5 <NA> <NA> <NA>  55
6   6 <NA> <NA> <NA>  66
7   7    g    u    j  77
8   8    h    i    k  88
9   9    i    o    l  99
10 10    j    p    m 110

As you see, "d" was merged in rows 4 to 6 and columns 2 to 4. The only NAs are due to the merged cells.

From the proposed code, cl and rw find the column and row of the merged value "d".

The last line finds all NAs and replaces them with "d".

Result:

   X1 X2 X3 X4  X5
1   1  a  q  a  11
2   2  b  w  s  22
3   3  c  e  d  33
4   4  d  d  d  44
5   5  d  d  d  55
6   6  d  d  d  66
7   7  g  u  j  77
8   8  h  i  k  88
9   9  i  o  l  99
10 10  j  p  m 110

Note: The colnames = F should be removed if your Excel data has column names.

R. Schifini
  • 9,085
  • 2
  • 26
  • 32
1

If all you need to do is to fill empty cells with an X, export your data from Excell as a tab-delimited text file, import them into R as character (as.is = TRUE) and replace blanks with X. The last line in the example converts all character columns to factors as per Roland's answer, which are handier for analyses.

tab <- read.table("yourExcelAsText.txt", sep="\t", header=TRUE, as.is=TRUE)
tab[tab==""] <- "X"
tab <- as.data.frame(unclass(tab))
Community
  • 1
  • 1
nya
  • 2,138
  • 15
  • 29
  • The entire sheet is considerably more complex than shown. I'm not trying to fill empty cells with X. "X" represents information in the merged cells that may change from sheet to sheet, and I need to retain that information in the sheet somehow. I don't want to save it, for example, only in the top left corner of unmarked cells because I need to be able to relate it to "b" which, depending on the sheet might be found in A2 or A3 or C1. So if I save it in each of the unmarked cells, I'll be able to pick it up in later steps that clean up this very messy spreadsheet. – Dave Stumped May 29 '16 at 13:22
  • In that case, @lmo provided code that will be more useful to you. – nya May 29 '16 at 13:39
0

You need to solve by case. The code below solve cases like the one you cite. I added one more. Save your file as csvor read as xlxs. This uses for but it could be vectorized for speed easily.

da=read.table("testtemp.csv",sep=";",na.strings=c("", "NA"),stringsAsFactors = F) 
#str(da)

#add more cases
da[5,1]<-da[5,2]<-da[4,2]<-NA
da

> da
    V1   V2   V3 V4
1    q    f    i  l
2    b    x <NA>  m
3    c <NA> <NA>  n
4    d <NA>    j  o
5 <NA> <NA>    k  p

#function to find cases
cencell=function(da){ #i=2;j=2
  fc=data.frame(matrix(NA,nrow(da)-1,3))
  ij=1
  for (i in 1:(ncol(da)-1))
    for (j in 1:(nrow(da)-1)){
      cst=c(da[j,i+1],da[j+1,i+1],da[j+1,i])
      if(all(is.na(cst))){fc[ij,1:2]<-c(j,i)
      fc[ij,3]<-da[j,i]
      ij<-1+ij}
    }
  fc[1:(ij-1),]
}
(ctc=cencell(da))
#replace cases
daf=da  #i=1
for(i in 1:nrow(ctc)){daf[ctc[i,1]+1,ctc[i,2]]<-ctc[i,3]
daf[ctc[i,1]:(ctc[i,1]+1),ctc[i,2]+1]<-ctc[i,3]
}
daf

> daf
  V1 V2 V3 V4
1  q  f  i  l
2  b  x  x  m
3  c  x  x  n
4  d  d  j  o
5  d  d  k  p
Robert
  • 5,038
  • 1
  • 25
  • 43