How do I remove repeated and identical rows from a dataframe in R?

Question

I have a large dataset with two many rows for my purpose, and I'm trying to figure out how to make it simple and usable. I have removed all of the columns that I do not need (for now) and removed all of the NA rows.

The data looks like this:

   SampleID   Score   Habitat  
   001-1         0     MCSHRU  
   001-2         1     MCSHRU  
   001-2         1     MCSHRU  
   001-2         1     MCSHRU  
   001-3         0       MCRU  
   001-4         3     MCSHRU  
   001-4         3     MCSHRU

Sample 001-2 has three entries, and they're all the same. This is because the original dataset had a row for each species found in each sample. I'm not interested in the species data and I just want to compare scores for each habitat.

I'd like to have just one row for each SampleID. I could take the mean or minimum for the Score data, but I'm not sure what to do with Habitat data because it is categorical.

How can I clean out the repeated data rows so that there is only one row of data for each SampleID?

It should look like this in the end:

  SampleID   Score   Habitat  
   001-1         0    MCSHRU  
   001-2         1    MCSHRU  
   001-3         0      MCRU  
   001-4         3    MCSHRU

`dplyr::distinct(df)` or if you have more columns `dplyr::distinct(df, SampleID, Score, MCSHRU, .keep_all = TRUE)` — Ronak Shah, Aug 04 '21 at 12:25

score 2 · Accepted Answer · answered Aug 04 '21 at 12:26

2

use unique(df) from base or distinct(df) from dplyr

answered Aug 04 '21 at 12:26

dy_by

1,061
1
4
13

1

Thanks for the lightning fast response! I used ```df2 <- df %>% distinct(SampleID, .keep_all = TRUE)``` and it worked perfectly. – ayesha Aug 04 '21 at 12:43

How do I remove repeated and identical rows from a dataframe in R?

1 Answers1