For example, my data set is like this:
Var1 Var2 value
1 ABC BCD 0.5
2 DEF CDE 0.3
3 CDE DEF 0.3
4 BCD ABC 0.5
unique
and duplicated
may not able to detect the duplication of row 3 and 4.
Since my data set is quite large so is there any efficient way to only keep the unique rows? Like this:
Var1 Var2 value
1 ABC BCD 0.5
2 DEF CDE 0.3
For your convince, you can use:
dat <- data.frame(Var1 = c("ABC", "DEF", "CDE", "BCD"),
Var2 = c("BCD", "CDE", "DEF", "ABC"),
value = c(0.5, 0.3, 0.3, 0.5))
Also, if possible is there any way to also produce a distribution table for the top 20 variables base on the Var1 (more than 10,000 levels).
P.S. I have tried dat$count <- dat(as.character(dat$Var1))[as.character(dat$Var1)]
, but it just take too long to run.