Performance Issue in R grouping data

Question

What I am trying to do: 1- Read file contents into a matrix (with two features/columns: ID and Text) 2- Collapse rows that have the same ID, or, if not possible, create a new matrix with the collapsed data 3- Output a .txt file in the wd that has the ID as a name and the Text as content

Here is what I did:

#set working directory and get file_list
myvar <- matrix(0,nrow=0,ncol=2)
colnames(myvar) <- c("PID","Seq")

for(file in file_list)
{
    print(file)
    Mymatrix <- as.matrix(read.table(file))

    for(i in 1:length(Mymatrix[,1]))
    {
        if(Mymatrix[i,1] %in% myvar[,1])
        {
            myvar[which(myvar[,1] == Mymatrix[i,1]) ,2] <- paste(myvar[which(myvar[,1] == Mymatrix[i,1]),2],Mymatrix[i,2])
        }else{
            myvar <- rbind(myvar,c(Mymatrix[i,1],Mymatrix[i,2]))
        }
    }
}

Performance is of issue, cf profvis output here: profvis results

Here is a reproducible code:

#Input:
a <- matrix(0,ncol=2, nrow=0)
colnames(a) <- c("id","text")

#possible data in the matrix after reading one file
a <- rbind(a,c(1,"4 5 7 7 8 1"))
a <- rbind(a,c(1,"5 5 1 3 7 5 1"))
a <- rbind(a,c(7,"5 5 1 3 7 5 1"))
a <- rbind(a,c(5,"1 3 2 25 5 1 3 7 5 1"))

#expected output after processing

   > a
     id  text                       
[1,] "1" "4 5 7 7 8 1 5 5 1 3 7 5 1"
[2,] "7" "5 5 1 3 7 5 1"            
[3,] "5" "1 3 2 25 5 1 3 7 5 1"

Note: The order of the text after collapsing rows was kept: (4 5 7 7 8 1 followed by 5 5 1 3 7 5 1 for ID=1)

As mentioned before the biggest issue is performance: the way I'm currently doing it takes way much time. Is there any solution with something like aggregate or apply?

See [this general QA](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega); it seems you need to apply `paste(text, collapse = " ")` with `id` being the group. — alexis_laz, Jun 21 '16 at 13:09

lmo · Accepted Answer · 2016-06-21T13:29:01.790

1

Here is a method using aggregate using paste with collapse=" " as suggested by @alexis-laz:

convert matrix to data.frame and aggregate by id
dfAgg <- aggregate(text ~ id, data=data.frame(a), FUN=paste, collapse=" ")

# coerce dfAgg to matrix
as.matrix(dfAgg)
     id  text                       
[1,] "1" "4 5 7 7 8 1 5 5 1 3 7 5 1"
[2,] "5" "1 3 2 25 5 1 3 7 5 1"     
[3,] "7" "5 5 1 3 7 5 1"

Note that the use of as.data.frame is not necessary in this example, as R will perform the coercion automatically. It seems like good programming practice to make coercions explicit.

edited Jun 21 '16 at 13:29

answered Jun 21 '16 at 13:15

lmo

37,904
9
56
69

I'm guessing aggregate doesn't accept a matrix as input that's why you used data=data.frame(a)? I will try that and see whether it would improve performance. – Imlerith Jun 21 '16 at 13:24
I've never used `aggregate` on a matrix, but just tried it and it worked. – lmo Jun 21 '16 at 13:26
One issue is that you are growing an object in a loop. This tends to have a large impact on performance as R has to repeatedly copy the object to a new location in each iteration in order to add that additional row (or column or element). When using loops, its better to preallocate the object with zeros or empty strings and then fill it up. – lmo Jun 21 '16 at 13:43

Performance Issue in R grouping data

1 Answers1