0

Good evening guys,I want to combine the same id row into one row,with additional column,and here is my part of data.

sample=structure(list(crsp_fundno = c(18021, 18021, 18021, 18021, 22436, 
                                      22436, 22436, 22436, 22436, 22436, 49805, 49805, 49805, 55603, 
                                      55603, 93362), seq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 
                                                             1L, 2L, 3L, 1L, 2L, 1L), begdt = structure(c(13513, 14298, 15027, 
                                                                                                          16149, 12417, 13969, 14910, 14918, 15042, 15644, 14782, 14910, 
                                                                                                          15544, 15505, 15531, 17571), class = "Date"), enddt = structure(c(14297, 
                                                                                                                                                                            15026, 16148, 17621, 13968, 14909, 14917, 15041, 15643, 17621, 
                                                                                                                                                                            14909, 15543, 17621, 15530, 17621, 17621), class = "Date"), crsp_obj_cd = c("EDYG", 
                                                                                                                                                                                                                                                        "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", 
                                                                                                                                                                                                                                                        "EDYG", "EF", "EF", "EF", "EDYB", "EDYB", "M"), lipper_class = c("MLGE", 
                                                                                                                                                                                                                                                                                                                         "MCCE", "MCVE", "MLCE", "MLVE", "MLVE", "MLCE", "MLVE", "MLCE", 
                                                                                                                                                                                                                                                                                                                         "MLVE", "IMLC", "IMLG", "IMLC", "MTAM", "MTAC", "MATJ"), lipper_obj_cd = c("G", 
                                                                                                                                                                                                                                                                                                                                                                                                    "G", "G", "G", "G", "G", "G", "G", "G", "G", "IF", "IF", "IF", 
                                                                                                                                                                                                                                                                                                                                                                                                    "GI", "GI", "I"), lipper_asset_cd = c("EQ", "EQ", "EQ", "EQ", 
                                                                                                                                                                                                                                                                                                                                                                                                                                          "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", 
                                                                                                                                                                                                                                                                                                                                                                                                                                          "EQ")), class = "data.frame", row.names = c(NA, -16L))

I tried to merge row which has the same ID in one row, and here is my code.

temp=list()
dn=unique(sample$crsp_fundno)
for(i in 1:length(dn) ){
  part=sample[which(sample$crsp_fundno %in% dn[i]),]
  part=reshape(part,idvar='crsp_fundno',timevar='seq',direction='wide')
  temp[[i]]=part
}

library(plyr)
sum=rbind.fill(temp[[1]],temp[[2]])

for (i in 3 :length(dn)){sum=rbind.fill(sum,temp[[i]])}

Code works ,but too low in my whole data(94000 obs almost take 2 hours).

I think I should not heavily rely on for loop in large data set.

May anyone know how can I improve the code or my logic ?

Thanks for your help.

  • 1
    Isn't this enough?: `sum <- reshape(sample,direction = "wide",idvar = "crsp_fundno",timevar = "seq")`. It gives the same result as you get for your `sum` at the end of your code! – Vitali Avagyan Sep 22 '19 at 13:49
  • @VitaliAvagyan oh.....interesting.So I use a too complicated method to do a simple thing,Thanks for your help – Chon Kit Hui Sep 22 '19 at 14:41
  • You're welcome:). I will add this as an answer and ask you to accept and upvote. – Vitali Avagyan Sep 22 '19 at 14:48
  • Possible duplicate https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format – Ronak Shah Sep 23 '19 at 01:03

2 Answers2

1

So, using reshape is correct, however, the implementation is not ideal. The function is already optimised to convert between long and wide formats without the need for any for loop.

You only need to call it once and save time:

library(reshape2)
sum <- reshape(sample,direction = "wide",idvar = "crsp_fundno",timevar = "seq")

As you correctly had a hunch, reshape is able to smoothly change between formats.

In your case you have:

  1. crsp_fundno is a variable in long format that identifies multiple records from the same group
  2. seq is the variable in long format that differentiates multiple records from the same group
Vitali Avagyan
  • 1,193
  • 1
  • 7
  • 17
1

I strongly recommend data.table when it comes to speed and memory efficiency.

setDT(sample) # in place, no assignment needed
sum3 <- dcast(sample,
              crsp_fundno ~ seq,
              value.var = names(sample)[3:8])

Here is a comparison of the OP's for loop, the reshape way suggested in one of the answers and the data.table way suggested in this post:

Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval cld
   for loop 23.735154 24.190626 25.948536 24.722330 26.176343 42.764253   100   c
    reshape  6.448800  6.742147  7.196820  6.850390  7.379401  9.932432   100  b 
 data.table  1.928812  2.143367  2.362979  2.255964  2.447935  5.847116   100 a  
haci
  • 241
  • 1
  • 8