3

I have a text variable and a grouping variable. I'd like to collapse the text variable into one string per row (combine) by factor. So as long as the group column says m I want to group the text together and so on. I provided a sample data set before and after. I am writing this for a package and have thus far avoided all reliance on other packages except for wordcloudand would like to keep it this way.

I suspect rle may be useful with cumsum but haven't been able to figure this one out.

Thank you in advance.

What the data looks like

                                 text group
1       Computer is fun. Not too fun.     m
2               No its not, its dumb.     m
3              How can we be certain?     f
4                    There is no way.     m
5                     I distrust you.     m
6         What are you talking about?     f
7       Shall we move on?  Good then.     f
8 Im hungry.  Lets eat.  You already?     m

What I'd like the data to look like

                                                       text group
1       Computer is fun. Not too fun. No its not, its dumb.     m
2                                    How can we be certain?     f
3                          There is no way. I distrust you.     m
4 What are you talking about? Shall we move on?  Good then.     f
5                       Im hungry.  Lets eat.  You already?     m

The Data

dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.", 
"How can we be certain?", "There is no way.", "I distrust you.", 
"What are you talking about?", "Shall we move on?  Good then.", 
"Im hungry.  Lets eat.  You already?"), group = structure(c(2L, 
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text", 
"group"), row.names = c(NA, 8L), class = "data.frame")

EDIT: I found I can add unique column for each run of the group variable with:

x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))

Yielding:

                                 text group new
1       Computer is fun. Not too fun.     m   1
2               No its not, its dumb.     m   1
3              How can we be certain?     f   2
4                    There is no way.     m   3
5                     I distrust you.     m   3
6         What are you talking about?     f   4
7       Shall we move on?  Good then.     f   4
8 Im hungry.  Lets eat.  You already?     m   5
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519

2 Answers2

5

This makes use of rle to create an id to group the sentences on. It uses tapply along with paste to bring the output together

## Your example data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.", 
"How can we be certain?", "There is no way.", "I distrust you.", 
"What are you talking about?", "Shall we move on?  Good then.", 
"Im hungry.  Lets eat.  You already?"), group = structure(c(2L, 
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text", 
"group"), row.names = c(NA, 8L), class = "data.frame")


# Needed for later
k <- rle(as.numeric(dat$group))
# Create a grouping vector
id <- rep(seq_along(k$len), k$len)
# Combine the text in the desired manner
out <- tapply(dat$text, id, paste, collapse = " ")
# Bring it together into a data frame
answer <- data.frame(text = out, group = levels(dat$group)[k$val])
Dason
  • 60,663
  • 9
  • 131
  • 148
  • 1
    I don't believe you need "seq(length(k$len))" since sequence will "seq_along" the k$length vector, giving you the equivalent sequence of numbers: id <- rep(seq(k$length), k$length) – Bryan Goodrich Mar 25 '12 at 05:04
  • @BryanGoodrich Good catch. Originally I was just going to do 1:length(k$len) but as of late I've been moving more toward using seq and seq_along and I guess I ended up with a mismash of the two approaches. – Dason Mar 25 '12 at 05:28
  • I usually just stick with seq, but for clarity I can see how seq_along makes it explicit that you're numerically traversing a vector of values. I often tend to go that clarity route when I deal with being redundant on boolean vectors using x[which(...some logic here ...)]. The which isn't necessary, but it does give a linguistic clarity to the coding that I prefer. – Bryan Goodrich Mar 26 '12 at 07:16
1

I got the answer and came back to post but Dason beat me to it and more understandably than my own.

x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))

Paste <- function(x) paste(x, collapse=" ")
aggregate(text~new, dat, Paste)

EDIT How I'd do it with aggregate and what I learned from your response (though tapply is a better solution):

y <- rle(as.character(dat$group))
x <- y[[1]]
dat$new <- as.factor(rep(1:length(x), x))

text <- aggregate(text~new, dat, paste, collapse = " ")[, 2]
data.frame(text, group = y[[2]])
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 1
    Note that you don't need to define "Paste" since aggregate allows you to pass additional parameters to the function being applied. You should be able to remove Paste and use this instead `aggregate(text ~ new, dat, paste, collapse = " ")` – Dason Mar 25 '12 at 04:06