Refactor dataframe rows according to a variable in R more efficiently

Question

I have dataframe of observation of malwares. 1 of it's variable is type. Because the type variable includes combination of types (such as: adware++trojan). For some reason, I need to duplicate these observations according to the types while giving each duplicated observation with each disassembled type. For example, for 1 observation:

     apksha                  time            type    market
8AB46C4A8AC   2013-09-23 16:04:24   adware++virus   1mobile

I want it to be like:

     apksha                  time            type    market
8AB46C4A8AC   2013-09-23 16:04:24          adware   1mobile
8AB46C4A8AC   2013-09-23 16:04:24           virus   1mobile

I'm right now using the embedded for loop for this task:

newData <- data.frame()
combinedTypes <- grep("\\+", types, value=TRUE, perl=TRUE)
ctData <- rawData[rawData$type %in% combinedTypes, ]
for(i in 1:nrow(ctData)){
    type <- ctData[i, ]$type
    newTypes <- unlist(strsplit(type, "\\+\\+"))
    for(t in newTypes){
        nr <- ctData[i, ]
        nr$type <- t
        newData <- rbind(newData, nr)
    }
}
rawData <- rawData[!(rawData$type %in% combinedTypes), ]
rawData <- rbind(rawData, newData)

problem is that it is very slow for R to run an embedded loop. So want to know if there any better solutions for this task?

Found a dirty and quick way:

splitedtype <- strsplit(rawData$type, "\\+\\+")
dataNew <- rawData[rep(seq_len(nrow(rawData)), lengths(splitedtype)), ]
dataNew$type <- unlist(splitedtype)

Would you mind putting a `dput()` of the first 10 or so rows of your data? — Gin_Salmon, Mar 03 '17 at 11:23
Thanks @Jaap, the question you mentioned is similar. But it cannot be solve the same way since my data has more than 2 variables (column). I updated my post with a more common solution. Borrowed idea from that question. — Jun Gao, Mar 03 '17 at 15:13
Have you read [my answer](http://stackoverflow.com/a/31514711/2204410) there? `tidyr::separate_rows(rawData, type, sep = '\\+\\+')` works perfectly — Jaap, Mar 03 '17 at 16:26

score 0 · Answer 1 · answered Mar 03 '17 at 11:30

First, separate the virus type into 2 different columns

dat <- read_table("apksha                  time            type          market  
8AB46C4A8AC   2013-09-23 16:04:24   adware++virus        1mobile")

dat <- 
dat %>% 
  separate(type, into = c("type1", "type2"), sep = "\\+\\+")

# A tibble: 1 × 5
       apksha                time  type1 type2  market
*       <chr>              <dttm>  <chr> <chr>   <chr>
1 8AB46C4A8AC 2013-09-23 16:04:24 adware virus 1mobile

Next, use reshape2::melt to reconstruct the data, columns in id.vars will not be touched

    melt(dat, id.vars=c("apksha", "time", "market"))

       apksha                time  market variable  value
1 8AB46C4A8AC 2013-09-23 16:04:24 1mobile    type1 adware
2 8AB46C4A8AC 2013-09-23 16:04:24 1mobile    type2  virus

Hope that helps and let me know if there are any other questions!

score 0 · Answer 2 · answered Mar 03 '17 at 11:33

We can use data.table to do:

library(data.table)

data1 <- data.table(apksha = c("8AB46C4A8AC"), time = c("2013-09-23 16:04:24"), type = c("adware++virus"), market = c("1mobile"))

data1[, paste0("type", 1:2) := tstrsplit(type, "\\+\\+")]
melt(data1[,.(apksha, time, market,type1,type2)], id.vars = c("apksha", "time", "market"))

 data1 >

     apksha                time  market variable  value
1: 8AB46C4A8AC 2013-09-23 16:04:24 1mobile    type1 adware
2: 8AB46C4A8AC 2013-09-23 16:04:24 1mobile    type2  virus

All you have to do is rename the column names and that should do!

Refactor dataframe rows according to a variable in R more efficiently

2 Answers2