1

I have a large dataset which I want remove all the rows except for the first 8 per value of 1 variable. (in this example only the first one)

example set:

   Time <- c(1:20)
    stimulus <- c(rep("happy 1",4),rep("happy 2",4),rep("disgust 1",4),rep("anger 1",4),rep("sad 1",4))
    Happy <- c(runif(20,0,1))
    Disgust <- c(runif(20,0,1))
    Anger <- c(runif(20,0,1))
    Subj1<- data.frame(Time,stimulus,Happy,Disgust,Anger)

SO: I want to remove all rows except for Subj1$stimulus 1st row of "happy 1", "happy 2", "disgust 1" etc. I manage to do so by subsetting to a new variable and then de-selecting everything but the first 8 rows using the following code:

Stim1<-which(Subj1$stimulus=="happy 1")
Subj1<- Subj1[-c(Stim1[2:length(Stim1)]),]

However, I want to automatically run this for all stimulus variables. Another thing that makes this more difficult is that the row numbers jump because of the removal of rows.

10 Rep
  • 2,217
  • 7
  • 19
  • 33
J.Jansen
  • 17
  • 7
  • You say you want to remove all rows "except for the first 8..." yet your example removes all except for the first ONE. What do you mean? – Zelazny7 Jun 08 '16 at 17:04
  • Or: http://stackoverflow.com/questions/13279582/select-only-the-first-rows-for-each-unique-value-of-a-column-in-r – Jaap Jun 08 '16 at 17:07

2 Answers2

1

If we need to remove the first row per each 'stimulus', one option with data.table would be to convert to data.table (setDT(Subj1)), grouped by 'stimulus', we remove the first observation with tail

library(data.table)
setDT(Subj1)[, tail(.SD,-1), by = stimulus]

Or if we need only the first observation, use head

setDT(Subj1)[, head(.SD,1), by = stimulus]
#   stimulus Time     Happy     Disgust     Anger
#1:   happy 1    1 0.2721827 0.263906233 0.3218399
#2:   happy 2    5 0.6649942 0.006288805 0.4758943
#3: disgust 1    9 0.4102272 0.275845885 0.6631558
#4:   anger 1   13 0.2924157 0.776806617 0.8609168
#5:     sad 1   17 0.1599896 0.010758160 0.6081846

Or another option is unique from data.table with the by option.

unique(setDT(Subj1), by = "stimulus")
#   Time  stimulus     Happy     Disgust     Anger
#1:    1   happy 1 0.2721827 0.263906233 0.3218399
#2:    5   happy 2 0.6649942 0.006288805 0.4758943
#3:    9 disgust 1 0.4102272 0.275845885 0.6631558
#4:   13   anger 1 0.2924157 0.776806617 0.8609168
#5:   17     sad 1 0.1599896 0.010758160 0.6081846

A dplyr option would be to group by 'stimulus' and get the first observation with slice.

library(dplyr)
Subj1 %>% 
     group_by(stimulus) %>% 
     slice(1)

Or use ave from base R

Subj1[with(Subj1, ave(seq_along(stimulus), stimulus, FUN = seq_along)==1),]
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can use the base R function duplicated to keep the first instance of a stimulus level:

newdf <- Subj1[!duplicated(Subj1$stimulus), ]

I had to make sure that stimulus was not a factor, using stringsAsFactors = FALSE

data

Subj1<- data.frame(Time,stimulus,Happy,Disgust,Anger, stringsAsFactors = FALSE)

If your data.frame is ordered by stimulus, and you want to keep the first m observations of each, you could use which with duplicated as follows:

# get rows to include
myRows <- c(sapply(which(duplicated(Subj1$stimulus)), function(i) i:(i+2)))
# subset
newdf <- Subj1[myRows, ]

The code above will select the first three observations of each stimulus level. Note as one drawback that it will not check that there are enough observations in a stimulus level.

However,you can perform this check using table(Subj1$stimulus).

lmo
  • 37,904
  • 9
  • 56
  • 69