-2

I have a data set with 10000 rows and 32 columns. I am wondering is it possible we choose some rows whose have the same value for some features?

Here is an example which make my question more clear.

col1   col2   col3  col4  col5
1       2     3      4    5  
3       4     3      6    8
2       2     5      4    5
4       2     7      4    5
5       4    `8      6    8`
2       3     1      0    9
3       4     1      5    2

In this data set there are 5 columns. Suppose I want to select some rows whose have same value in column 2,4 and 5.

As it can be seen, the first, third and forth row have same value in col2 , col4 and col5 also second and 5-th rows have same value in those columns. So I will pick these rows and new data set will be

 col1   col2   col3  col4  col5
  1       2     3      4    5  
  3       4     3      6    8
  2       2     5      4    5
  4       2     7      4    5
  5       4    `8      6    8`
sherek_66
  • 501
  • 5
  • 14
  • 2
    You should search before posting. And you should look at the nominations that the SO interface offers. I'm surprised that there was a duplicate offered for your review, since this seems to be a duplicate of others I have seen. If you find a question that is similar then you should link to it and explain why you are having difficulty applying it. – IRTFM Jun 29 '19 at 14:45
  • @42 honestly, I couldn't find any. could you plz refer me to that page? – sherek_66 Jun 29 '19 at 14:46
  • and your negative vote is not fair, when I didn't find similar question – sherek_66 Jun 29 '19 at 14:47
  • @42 I even didn't know it is called duplicated data – sherek_66 Jun 29 '19 at 14:49
  • If you have done a search, you can avoid negative votes by documenting the search strategy. Otherwise people may assume as I did that you did not do any searching. – IRTFM Jun 29 '19 at 14:51
  • that link is very different than my question! – sherek_66 Jun 29 '19 at 14:53
  • 1
    As I said earlier, if you find a similar but not quite exactly applicable question (and it might not be the first one you look at) then you should describe what attempts you have made to apply it and how it fails. You should not look at only one potential question but at at least two or three based on their titles. – IRTFM Jun 29 '19 at 15:00

1 Answers1

1

I think the link provided by @42 gives you an idea how to solve this problem. You need to select the columns and apply duplicated from both ends to select rows.

cols <- c(2, 4, 5)
df[duplicated(df[cols]) | duplicated(df[cols], fromLast = TRUE), ]

#  col1 col2 col3 col4 col5
#1    1    2    3    4    5
#2    3    4    3    6    8
#3    2    2    5    4    5
#4    4    2    7    4    5
#5    5    4    8    6    8

Or another way to solve this using dplyr would be to group_by the respective columns and select groups which have more than one row in the group.

library(dplyr)
df %>%  group_by_at(cols) %>% filter(n() > 1)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213