5

I have what seems like a very inefficient solution to a peculiar problem. I have text data which, for various reasons, is broken across rows of a dataframe at random intervals. However, certain subsets of are known to belong together based on unique combinations of other variables in the dataframe. See, for example, a MWE demonstrating the structure and my initial solution:

# Data
df <- read.table(text="page passage  person index text
1  123   A   1 hello      
1  123   A   2 my
1  123   A   3 name
1  123   A   4 is
1  123   A   5 guy
1  124   B   1 well
1  124   B   2 hello
1  124   B   3 guy",header=T,stringsAsFactors=F)

master<-data.frame()
for (i in 123:max(df$passage)) {
  print(paste0('passage ',i))
  tempset <- df[df$passage==i,]
  concat<-''
  for (j in 1:nrow(tempset)) {
    print(paste0('index ',j))
    concat<-paste(concat, tempset$text[j])
  }
  tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
  master<-rbind(master, tempdf)
  rm(concat, tempset, tempdf)
}
master
> master
  tempset.page.1. tempset.passage.1. tempset.person.1.                concat
1               1                123                 A  hello my name is guy
2               1                124                 B        well hello guy

In this example as in my real case, "passage" is the unique grouping variable, so it is not entirely necessary to take the other pieces along with it, although I'd like them available in my dataset.

My current estimates are that this procedure I have devise will take several hours for a dataset that is otherwise easily handled by R on my computer. Perhaps there are some efficiencies to be gained either by other functions or packages, or not creating and removing so many objects?

Thanks for any help here!

SOConnell
  • 793
  • 1
  • 8
  • 27

2 Answers2

9

data.table Here's one way:

require(data.table)
DT <- data.table(df)

DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
#    page passage person               concat
# 1:    1     123      A hello my name is guy
# 2:    1     124      B       well hello guy

Putting the extra variables (besides passage) in the by doesn't cost much, I think.


dplyr The analogue is

df %>% 
  group_by(page,passage,person) %>% 
  summarise(concat=paste0(text,collapse=" "))

# Source: local data frame [2 x 4]
# Groups: page, passage, person
# 
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 2    1     124      B       well hello guy

base R One way is:

df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 6    1     124      B       well hello guy
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 3
    nice answer, +1. one thing though: why not `DT[, .(concat=paste0(...` just so the data.table answer has an explicitly labeled column like the others? – arvi1000 May 15 '15 at 21:36
4

Here are two ways:

base R

aggregate(
    text ~ page + passage + person, 
    data=df, 
    FUN=paste, collapse=' '
)

dplyr

library(dplyr)
df %>% 
    group_by_(~page, ~passage, ~person) %>%
    summarize_(text=~paste(text, collapse=' '))
Frank
  • 66,179
  • 8
  • 96
  • 180
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • `select` is not needed. What're `~` and `_` doing? Similarly, you don't need to subset `df` for `aggregate`. `data=df` seems to work. – Frank May 15 '15 at 19:47
  • 1
    Thanks, you're right. Check out `vignette('nse', package='dplyr')` – Matthew Plourde May 15 '15 at 19:50
  • I chose this answer because it was marginally simpler and used base--though all the others worked just as well. Thanks all! – SOConnell May 15 '15 at 23:20