2

I have a large dataset with two many rows for my purpose, and I'm trying to figure out how to make it simple and usable. I have removed all of the columns that I do not need (for now) and removed all of the NA rows.

The data looks like this:

   SampleID   Score   Habitat  
   001-1         0     MCSHRU  
   001-2         1     MCSHRU  
   001-2         1     MCSHRU  
   001-2         1     MCSHRU  
   001-3         0       MCRU  
   001-4         3     MCSHRU  
   001-4         3     MCSHRU

Sample 001-2 has three entries, and they're all the same. This is because the original dataset had a row for each species found in each sample. I'm not interested in the species data and I just want to compare scores for each habitat.

I'd like to have just one row for each SampleID. I could take the mean or minimum for the Score data, but I'm not sure what to do with Habitat data because it is categorical.

How can I clean out the repeated data rows so that there is only one row of data for each SampleID?

It should look like this in the end:

  SampleID   Score   Habitat  
   001-1         0    MCSHRU  
   001-2         1    MCSHRU  
   001-3         0      MCRU  
   001-4         3    MCSHRU
ayesha
  • 135
  • 15
  • 1
    `dplyr::distinct(df)` or if you have more columns `dplyr::distinct(df, SampleID, Score, MCSHRU, .keep_all = TRUE)` – Ronak Shah Aug 04 '21 at 12:25

1 Answers1

2

use unique(df) from base or distinct(df) from dplyr

dy_by
  • 1,061
  • 1
  • 4
  • 13
  • 1
    Thanks for the lightning fast response! I used ```df2 <- df %>% distinct(SampleID, .keep_all = TRUE)``` and it worked perfectly. – ayesha Aug 04 '21 at 12:43