-1

Suppose i have a dataframe with 100 rows and 100 columns.

For each row, if any 2 columns have the same value, then this row should be removed.

For example, if column 1 and 2 are equal, then this row should be removed.

Another example, if column 10 and column 47 are equal, then this row should be removed as well.

Example:

test <- data.frame(x1 = c('a', 'a', 'c', 'd'),
               x2 = c('a', 'x', 'f', 'h'),
               x3 = c('s', 'a', 'f', 'g'),
               x4 = c('a', 'x', 'u', 'a'))

test

  x1 x2 x3 x4
1  a  a  s  a
2  a  x  a  x
3  c  f  f  u
4  d  h  g  a

Only the 4th row should be kept.

How to do this in a quick and concise way? Not using for loops....

  • I mean, I'm not convinced you can do this without for loops – Rushabh Mehta Jul 26 '18 at 20:09
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 26 '18 at 20:14
  • https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/unique – BroVic Jul 26 '18 at 20:23

3 Answers3

2

Tested on this 20x20 dataframe

library(tidyverse)    

N <- 20
df <- matrix(as.integer(runif(N^2, 1, 500)), nrow = N, ncol = N) %>% 
  as.tibble()

df

# # A tibble: 20 x 20
# V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11   V12   V13   V14   V15   V16   V17   V18   V19   V20
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#   1   350   278   256   484   486   249    35   308   248    66   493   130   149     2   374    51   370   423   165   388
# 2   368   448   441    62   304   373    38   375   406   463   412    95   174   365   170   113   459   369    62    21
# 3   250   459   416   128   372    67   281   450    48   122   308    56   121   497   498   220    34     4   126   411
# 4   171   306   390    13   395   160   256   258    76   131   471   487   190   492    21   237   380   129     5    30
# 5   402   421     6   401    50   292   470   319   283   178   234    46   176   178   288   499     7   221   123   268
# 6   415   342   132   379   150    35   323   225   246   496   460   478   205   255   460    62    78   207    82   118
# 7   207    52   420   216     9   366   390   382   304    63   427   425   350   112   488   400   328   239   148    40
# 8   392   455   156   386   478     3   359   184   420   138    29   434    31   279    87   233   455    21   181   437
# 9   349   460   498   278   104    93   253   287   124   351    60   333   321   116    19   156   372   168    95   169
# 10   386    73   362   127   313    93   427    81   188   366   418   115   353   412   483   147   295    53    82   188
# 11   272   480   168   306   359    75   436   228   187   279   410   388    62   227   415   374   366   313   187    49
# 12   177   382   233   146   338    76   390   232   336   448   175    79   202   230   317   296   410    90   102   465
# 13   108   433    59   151     8   138   464   458   183   316   481   153   403   193    71   136    27   454    62   439
# 14   421    72   106   442   338   440   476   357    74   108    94   407   453   262   355   356    27   217   243   455
# 15   325   449   151   473   241    11   154    52    77   489   137   279   420   120   165   289    70   128   384    53
# 16   126   189    43   354   233   168    48   285   175   348   404   254   168   126    95    65   493   493   187   228
# 17    26   143   112   107   350   198   353   439   192   158   151    23   326     4   304   162    84   412   499   170
# 18    88   156   222   227   452   233   397   203   478    73   483   241   151    38   176    77   244   396     9   393
# 19   361   486   423   310   153   235   274   204   399   493   422   374   399    10   215   468   322    38   395   390
# 20   417   124    21   220   123   399   354   182   233    24   397   263   182   211   360   419   202   240   363   187

Removing rows with any duplicates

df %>% 
  group_by(id = row_number()) %>% 
  gather(col, value, -id) %>% 
  filter(!any(duplicated(value))) %>% 
  spread(col, value)

# # A tibble: 11 x 21
# # Groups:   id [11]
# id    V1   V10   V11   V12   V13   V14   V15   V16   V17   V18   V19    V2   V20    V3    V4    V5    V6    V7    V8    V9
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#   1     1   350    66   493   130   149     2   374    51   370   423   165   278   388   256   484   486   249    35   308   248
# 2     3   250   122   308    56   121   497   498   220    34     4   126   459   411   416   128   372    67   281   450    48
# 3     4   171   131   471   487   190   492    21   237   380   129     5   306    30   390    13   395   160   256   258    76
# 4     7   207    63   427   425   350   112   488   400   328   239   148    52    40   420   216     9   366   390   382   304
# 5     9   349   351    60   333   321   116    19   156   372   168    95   460   169   498   278   104    93   253   287   124
# 6    12   177   448   175    79   202   230   317   296   410    90   102   382   465   233   146   338    76   390   232   336
# 7    13   108   316   481   153   403   193    71   136    27   454    62   433   439    59   151     8   138   464   458   183
# 8    14   421   108    94   407   453   262   355   356    27   217   243    72   455   106   442   338   440   476   357    74
# 9    15   325   489   137   279   420   120   165   289    70   128   384   449    53   151   473   241    11   154    52    77
# 10    17    26   158   151    23   326     4   304   162    84   412   499   143   170   112   107   350   198   353   439   192
# 11    18    88    73   483   241   151    38   176    77   244   396     9   156   393   222   227   452   233   397   203   478
Jack Brookes
  • 3,720
  • 2
  • 11
  • 22
2

Use apply to look for duplicates in each row. (Note that this internally converts your data to a matrix for the comparison. If you are doing a lot of row-wise operations I would recommend either keeping it as a matrix or converting it to a long format as in Jack Brookes's answer.)

# sample data
set.seed(47)
dd = data.frame(matrix(sample(1:5000, size = 100^2, replace = TRUE), nrow = 100))

# remove rows with duplicate entries
result = dd[apply(dd, MARGIN =  1, FUN = function(x) !any(duplicated(x))), ]
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks Gregor, your solution also works. I learned a lot from you and Jack. Thanks both. Your solution is more concise i think. – nietzschetmh Jul 26 '18 at 21:18
0

You can try a series of filters from dplyr. I cooked up some sample data here. If your variables are named then you can use something like the first example. Otherwise the second should work

library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.1
data <- data_frame(
  A = c(1,2,3,4,5,6),
  B= c(1,3,5,7,9,11),
  C = c(2,2,6,8,10,12)
)

data %>%
  filter(A != B) %>%  # This removed the first row
  filter(A != C) # This removed the second row
#> # A tibble: 4 x 3
#>       A     B     C
#>   <dbl> <dbl> <dbl>
#> 1     3     5     6
#> 2     4     7     8
#> 3     5     9    10
#> 4     6    11    12

data %>% 
  filter(.[1] != .[2]) %>% 
  filter(.[1] != .[3])
#> # A tibble: 4 x 3
#>       A     B     C
#>   <dbl> <dbl> <dbl>
#> 1     3     5     6
#> 2     4     7     8
#> 3     5     9    10
#> 4     6    11    12
  • I don't think this works well since OP says they have 100 columns. `choose(100, 2)` gives 4950, which is a lot of pairwise comparisons to type. – Gregor Thomas Jul 26 '18 at 20:40
  • Agreed, this would only work if OP knows which columns shouldn't have duplicated values. Yours generalizes to all columns – Felipe Flores Jul 26 '18 at 20:45