0

I am trying to read into R Excel spreadsheets that include time-entry rows grouped by employee, which look like this when the groups are collapsed (the commas are used here to delimit columns):

Column A    Column B

Alice

2015-01-01  8
2015-01-02  7.5
2015-01-03  6

Bob

2015-01-02  6
2015-01-03  8

I am able to read the spreadsheets into data frames using the xlsx::read.xlsx2 function, but I have been unable to figure out how to convert the subtotal rows into a column, so that the data frame looks like this:

Alice   2015-01-01  8
Alice   2015-01-02  7.5
Alice   2015-01-03  6
Bob     2015-01-02  6
Bob     2015-01-03  8

I tried looking at reshape and dplyr, but I couldn't figure out if they could help. Can someone please point me in the right direction?

crlsrns
  • 799
  • 1
  • 5
  • 5
  • What do you mean by subtotal? The names? – talat Feb 14 '15 at 09:44
  • Can you format initial table with tabs, to distinguish columns? I believe [`tidyr`](http://cran.r-project.org/web/packages/tidyr/index.html) [`gather`](http://cran.r-project.org/web/packages/tidyr/tidyr.pdf) right for this. – m0nhawk Feb 14 '15 at 10:21
  • Will the number of time entries be the same for each Alice, Bob, etc? – lawyeR Feb 14 '15 at 10:55
  • [how to pivot/unpivot (cast/melt) data frame?](http://stackoverflow.com/q/7980030/1176601) might be useful – Aprillion Feb 14 '15 at 12:16
  • Are the dates in the same column as Alice? Check out `na.locf` from the `zoo` package if they're not. – Hugh Feb 14 '15 at 13:17
  • The number of entries for each user will not necessarily by the same for any given date; and the dates are in fact in the same column as the employee name. – crlsrns Feb 16 '15 at 15:41

1 Answers1

0

May be this helps

library(dplyr)
library(tidyr)
#read the file using `readLines`
lines <- readLines('file.csv')
#remove the empty elements
lines1 <- lines[lines!='']
#create a grouping index based on the occurrence of non-numeric elements 
indx <- cumsum(grepl('^[A-Za-z]', lines1))
#create another index for finding the position of non-numeric element 
indx1 <- grep('^[A-Za-z]', lines1)
#split the lines based on the grouping index
lst <- setNames(split(lines1[-indx1], indx[-indx1]), lines1[indx1])
#use unnest from tidyr and split the `x` column into two
unnest(lst, Name) %>% 
           extract(x, c('Date', 'val'), '(.*),(.*)', convert=TRUE)
#   Name       Date val
#1 Alice 2015-01-01   8
#2 Alice 2015-01-02 7.5
#3 Alice 2015-01-03   6
#4   Bob 2015-01-02   6
#5   Bob 2015-01-03   8

Or you could use base R.

#read the data using `read.csv` or `read.xlsx2`.  Here `,` is the delimiter
d1 <- read.csv('file.csv', header=FALSE, stringsAsFactors=FALSE)
#second column `V2` will have `NAs` for corresponding words in `V1`
indx <- is.na(d1$V2)
#subset the dataset by removing the `NA` rows 
d2 <- d1[!indx,]
#use one of the aggregating functions
#remove the first element for each group  
d2$names <-  unlist(tapply(rep(d1$V1[indx], tabulate(cumsum(indx))), 
             cumsum(indx), FUN=tail,-1), use.names=FALSE)
d2
#         V1  V2 names
#2 2015-01-01 8.0 Alice
#3 2015-01-02 7.5 Alice
#4 2015-01-03 6.0 Alice
#6 2015-01-02 6.0   Bob
#7 2015-01-03 8.0   Bob
akrun
  • 874,273
  • 37
  • 540
  • 662