I have a large dataset with two many rows for my purpose, and I'm trying to figure out how to make it simple and usable. I have removed all of the columns that I do not need (for now) and removed all of the NA rows.
The data looks like this:
SampleID Score Habitat
001-1 0 MCSHRU
001-2 1 MCSHRU
001-2 1 MCSHRU
001-2 1 MCSHRU
001-3 0 MCRU
001-4 3 MCSHRU
001-4 3 MCSHRU
Sample 001-2 has three entries, and they're all the same. This is because the original dataset had a row for each species found in each sample. I'm not interested in the species data and I just want to compare scores for each habitat.
I'd like to have just one row for each SampleID. I could take the mean or minimum for the Score data, but I'm not sure what to do with Habitat data because it is categorical.
How can I clean out the repeated data rows so that there is only one row of data for each SampleID?
It should look like this in the end:
SampleID Score Habitat
001-1 0 MCSHRU
001-2 1 MCSHRU
001-3 0 MCRU
001-4 3 MCSHRU