Remove Duplicated String in a Row

Question

The following problem:

I have the data frame data1 with a variable including several entries:

data1 <- data.frame(v1 = c("test, test, bird", "bird, bird", "car"))

Now I want to remove duplicated entries in each row. The result should look like this:

data1.final <- data.frame(v1 = c("test, bird", "bird", "car"))

I tried this:

data1$ID <- 1:nrow(data1)
data1$v1 <- as.character(data1$v1)

data1 <- split(data1, data1$ID)
reduce.words <- function(x) {
  d <- unlist(strsplit(x$v1, split=" "))
  d <- paste(d[-which(duplicated(d))], collapse = ' ')
  x$v1 <- d 
  return(x)
}
data1 <- lapply(data1, reduce.words)
data1 <- as.data.frame(do.call(rbind, data1))

However, this yields empty rows, except the first one. Anyone an idea to solve this problem?

score 5 · Accepted Answer · answered Nov 27 '14 at 15:27

You seem to have a rather complicated workflow. What about just creating a simple function that works on the rows

reduce_row = function(i) {
  split = strsplit(i, split=", ")[[1]]
  paste(unique(split), collapse = ", ") 
}

and then using apply

data1$v2 = apply(data1, 1, reduce_row)

to get

R> data1
                v1         v2
1 test, test, bird test, bird
2       bird, bird       bird
3              car        car

akrun · Answer 2 · 2014-11-27T16:54:36.323

3

Another option using cSplit from splitstackshape

library(splitstackshape)
cSplit(cbind(data1, indx=1:nrow(data1)), 'v1', ', ', 'long')[,
        toString(v1[!duplicated(v1)]), 
                                  by=indx][,indx:=NULL][]
  #          V1
  #1: test, bird
  #2:       bird
  #3:        car

Or as @Ananda Mahto mentioned in the comments

 unique(cSplit(as.data.table(data1, keep.rownames = TRUE),
                    "v1", ",", "long"))[, toString(v1), by = rn]

 #   rn         V1
 #1:  1 test, bird
 #2:  2       bird
 #3:  3        car

edited Nov 27 '14 at 16:54

answered Nov 27 '14 at 15:37

akrun

874,273
37
540
662

(+1). My personal preference (not sure if it's any more efficient or not) is to use `keep.rownames` and `unique`, so something like `unique(cSplit(as.data.table(data1, keep.rownames = TRUE), "v1", ",", "long"))[, toString(v1), by = rn]` instead. – A5C1D2H2I1M1N2O1R2T1 Nov 27 '14 at 16:49
@AnandaMahto Thanks, I didn't thought about `keep.rownames` It is a good option. – akrun Nov 27 '14 at 16:53

Remove Duplicated String in a Row

2 Answers2

Linked