0

For context: this is a follow up to this query which I recently posted: R - Identify and remove duplicate rows based on two columns

I need to do something very similar to what I described in that post, but let me explain here in full.

I have some data that looks like this (in case it's relevant, there are MANY other columns with other data):

Course_ID   Text_ID
33          17
33          17
58          17
5           22
8           22
42          25
42          25
17          26
17          26
35          39
51          39

I need to identify any instances where there are two or more matching values for Course_ID AND Text_ID. For example, in the data above, the first two rows in both columns are identical (33 and 17). I need to remove just one of these duplicate lines wherever they occur.

The final data should look like this:

Course_ID   Text_ID
33          17
58          17
5           22
8           22
42          25
17          26
35          39
51          39

The solution offered in my previous post removed all instances of any duplicate rows.

Thanks in advance.

Japes
  • 209
  • 1
  • 10

2 Answers2

1
subset(df, !duplicated(df[c('Course_ID', 'Text_ID')]))
   Course_ID Text_ID
1         33      17
3         58      17
4          5      22
5          8      22
6         42      25
8         17      26
10        35      39
11        51      39

or even

df[!duplicated(df[c('Course_ID', 'Text_ID')]), ]

If only 2 columns as shown, just do unique(df)

Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

Does this work:

library(dplyr)
df %>% group_by(Course_ID, Text_ID) %>% distinct()
# A tibble: 8 x 2
# Groups:   Course_ID, Text_ID [8]
  Course_ID Text_ID
      <dbl>   <dbl>
1        33      17
2        58      17
3         5      22
4         8      22
5        42      25
6        17      26
7        35      39
8        51      39
Karthik S
  • 11,348
  • 2
  • 11
  • 25