How to merge rows of differing lengths preserving sequence based on ID variable in R

Question

I am working in R from a csv that characterizes distinct events (labeled below as "A", "B", "C"). I'm interested in the sequence of behaviors ("x", "y", "z" below) within an event, but events may be divided among more than one row. For the question I am interested in I would like to line up the full sequence of behaviors in a single row for each event. I'm having trouble figuring out how to do this in R.

This is what my data look like:

    Behavior 1 |  Behavior 2 | Behavior 3  | Behavior 4  | Behavior 5
A | x          |  x          |             |             |
A | y          |             |             |             |
B | y          |  x          |             |             |
C | y          |  z          |  x          |             |
C | x          |             |             |             |

This is what I would like my data to look like:

     Behavior 1 |  Behavior 2  | Behavior 3  | Behavior 4  | Behavior 5
A |  x          |  x           | y           |             |
B |  y          |  x           |             |             |
C |  y          |  z           | x           | x           |

Thanks in advance for the help!

score 1 · Accepted Answer · answered Jun 30 '17 at 08:27

I suggest you get comfortable with dplyr and tidyr packages in R, as they are relatively easy to use. grab the data wranlging cheat sheet https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf and here you go:

event <- c("A", "A", "B", "C", "C")
behavior1 <- c("x", "y","y", "y", "x")
behavior2 <- c("x", "","x", "z", "")
behavior3 <- c("", "","", "x", "")
behavior4 <- c("", "","", "", "")
behavior5 <- c("", "","", "", "")
df <- data.frame(event, behavior1, behavior2,behavior3,behavior4,behavior5, stringsAsFactors =F)
df

library(tidyr)
library (dplyr)
#make table flat
df2 <- gather(df, behavior, outcome, -event)
df2
#remove empty rows and sort (sort is only to make it easier to understand)
df3 <- df2 %>% filter(outcome != "") %>% arrange(event)
df3
#create row number per event 
df4 <- df3 %>% group_by(event) %>% mutate (t = row_number(), behavior_new=paste("Behavior", t)) 
df4

#drop old behavir and t column
df5 <- df4 %>% select (-behavior, -t)
df5

#spread out bevhavior again
spread(df5, behavior_new, outcome )

PS: For your next question, please check this question and the first answer How to make a great R reproducible example? to ask questions in a better style.

score 1 · Answer 2 · answered Jun 30 '17 at 09:12

Alternatively, if you wish to avoid using other packages for whatever reason, you can do this:

beh <- matrix( c("A", "A", "B", "C", "C",
    "x", "y", "y", "y", "x",
    "x", NA, "x", "z", NA,
    NA, NA, NA, "x", NA,
    NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA),
    ncol=6)

ret.list <- list()
events <- unique(beh[,1])

for(evt in events)
{   
    sel <- beh[,1] == evt
    row <- na.omit(as.vector(t(beh[sel, -1])))
    ret.list[[evt]] <- as.vector(row)
}

# if you want a matrix instead:
max.beh <- max(unlist(lapply(ret.list, length)))

ret.mat <- matrix(NA, nrow=length(events), ncol=max.beh)
for(i in 1:length(events))
{
    evt.beh <- ret.list[[events[i]]]
    ret.mat[i, 1:length(evt.beh)] <- evt.beh
}

rownames(ret.mat) <- events
colnames(ret.mat) <- paste("Behavior", 1:max.beh)

This just goes over the rows and pastes the labels from events into list items, removing NAs. If you want a matrix instead, the dimension is determined by finding the longest list item in ret.list and the number of unique events. The list items are then pasted into appropriate rows.

How to merge rows of differing lengths preserving sequence based on ID variable in R

2 Answers2