How to subset a dataframe using the number of rows per group as a condition

Question

I conducted a diary study in which for 5 days, participants had to answer to 2 times.

My criteria was that people had to answer to at least 3 full days out of the 5. So, that from the overall 10 times in which the questionnaire took place, they had to answer to at least 6 times. Everytime they filled in the questionnaire they had to put a personal code, which is why I can see who answered and how many times.

I put like this:

Morning_Afternoon_PT_EN: is the name of the database

respfreq <- calc.nomiss(Morning_Afternoon_PT_EN$day, tolower(Morning_Afternoon_PT_EN$code), data=Morning_Afternoon_PT_EN)
print(respfreq)

enter image description here

   952345172    alju12    amou79    amou91    baab81 
        0         5        10        10        10        10 
   base85    beju58    cade61    caju21    chno45    crju09 
       10        10        10        10         5         7 
   faap52    fuau48    fude38    fuma07    huju03    leja26 
       10         8         3        10         8        10 
   leju40    lema32    leno81    liab14    liab20    liab50 
       10         9         8         9        10         9 
  liabr14    liag30    liag60   liap520    liau35    lide50 
        1        10         9        10         9         9 
   life10    life74    lija05    lija45    lija78    liju65 
        9         1        10        10         9        10 
   liju94    lima40    lima82    limf96    lioc46    lioc84 
        9        10        10         4        10        10 
   lise50    lise88    maab31    moag91    moap58    pode04 
        9        10        10        10         9         8 
   sade61    saja28    saja79    saoc06    sema72    sema83 
        9        10        10         9        10        10 
   tose37    vima32 
        9         9

length(respfreq)
[1] 56

So, I see that "952345172", "chno45", "limf96","liabr14","life74", "fude38" do not meet the requiremente and I want to eliminate them from the overall data base.

I tried to use subset, like:

NewDataFrame<-subset(Morning_Afternoon_PT_EN, respfreq>6)

But, I get the answer:

NewDataFrame<-subset(Morning_Afternoon_PT_EN, respfreq>6)

Error: Must subset rows with a valid subscript vector. i Logical subscripts must match the size of the indexed input. x Input has size 485 but subscript r has size 56.

I understand the error, but I don't know how to solve it.

Please do not post photos of data or code! If you do, people who are willing to help you would have to type out all that text. Instead provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) P.S. Here is [a good overview on how to ask a good question](https://stackoverflow.com/help/how-to-ask) — dario, Dec 03 '21 at 11:06

Jose · Answer 1 · 2021-12-03T18:23:09.313

You should include the column with the counts in the dataframe in order to use subset

x <- c("952345172", "alju12", "amou79", "amou91", "baab81", NA)

code <- rep(x, c(5, 10, 10, 20, 2, 7))

df <- data.frame(id = 1:length(code), code)

head(df)

##   id      code
## 1  1 952345172
## 2  2 952345172
## 3  3 952345172
## 4  4 952345172
## 5  5 952345172
## 6  6    alju12

library(dplyr)

df2 <- left_join(df, na.omit(df) |> count(code)) 

df2 <- subset(df2, n > 6)

head(df2)

##    id   code  n
## 6   6 alju12 10
## 7   7 alju12 10
## 8   8 alju12 10
## 9   9 alju12 10
## 10 10 alju12 10
## 11 11 alju12 10

Another option is to use:

tabc <- table(df$code)

df3 <- df[df$code %in% names(tabc[tabc > 6 ]), ]

How to subset a dataframe using the number of rows per group as a condition

1 Answers1