1

For example, my data set is like this:

  Var1 Var2 value
1  ABC  BCD   0.5
2  DEF  CDE   0.3
3  CDE  DEF   0.3
4  BCD  ABC   0.5

unique and duplicated may not able to detect the duplication of row 3 and 4.

Since my data set is quite large so is there any efficient way to only keep the unique rows? Like this:

  Var1 Var2 value
1  ABC  BCD   0.5
2  DEF  CDE   0.3

For your convince, you can use:

dat <- data.frame(Var1 = c("ABC", "DEF", "CDE", "BCD"),
                  Var2 = c("BCD", "CDE", "DEF", "ABC"),
                  value = c(0.5, 0.3, 0.3, 0.5))

Also, if possible is there any way to also produce a distribution table for the top 20 variables base on the Var1 (more than 10,000 levels).

P.S. I have tried dat$count <- dat(as.character(dat$Var1))[as.character(dat$Var1)], but it just take too long to run.

Y. Z.
  • 369
  • 2
  • 16
  • So, if I understand correctly, there is no distinction between `Var1` and `Var2`? – divibisan Oct 02 '18 at 18:08
  • @divibisan A very good question. Nope. ABC BCD is the same as BCD ABC. – Y. Z. Oct 02 '18 at 18:12
  • Let us know if the link doesn't help. Re the "also if possible", better to ask one question at a time, I think. – Frank Oct 02 '18 at 18:24
  • @Frank thanks, I will try all of them. The biggest problem is time consuming. – Y. Z. Oct 02 '18 at 18:37
  • Hm, if you want to make an example that can scale up to a large size (eg, as a function of some `n` rows, `ng` unique values), then we could reopen and get more focused answers on the performance/speed issue. Fwiw, I would use `stringsAsFactors=FALSE` and edit the columns so they're sorted ... `library(data.table); setDT(dat); dat[Var1 < Var2, c("Var1", "Var2") := .(Var2, Var1)]; unique(dat, by=c("Var1", "Var2"))` – Frank Oct 02 '18 at 18:42

2 Answers2

2

Another option would be to sort columns Var1 and Var2 rowwise and then apply duplicated.

idx <- !duplicated(t(apply(dat[c("Var1", "Var2")], 1, sort)))
dat[idx, ]
#  Var1 Var2 value
#1  ABC  BCD   0.5
#2  DEF  CDE   0.3
markus
  • 25,843
  • 5
  • 39
  • 58
0

I would start with sorting the value1 and value2 first, and then use unique. When you have only two columns you can just use pman, and pmin:

dat <- data.frame(
   Var1 = c("ABC", "DEF", "CDE", "BCD"),
   Var2 = c("BCD", "CDE", "DEF", "ABC"),
   value = c(0.5, 0.3, 0.3, 0.5))


library(dplyr)
dat %>% mutate(v1 = pmax(as.character(Var1), as.character(Var2)),
               v2 = pmin(as.character(Var1), as.character(Var2))) %>%
  select(v1, v2, value) %>% unique()

#   v1  v2 value
# 1 BCD ABC   0.5
# 2 DEF CDE   0.3

However it might be a bit more complicated when you have more columns VarN.