0

I just started using R some days ago. For a data analysis I stumbled over the following problem:

I have several rows and columns of data. I am interested in the column A. There are some rows with same values in the column A. If there are 10 or more rows with same values, I want to keep them. The other rows I don't want to use in further analysis.

What I wrote so far:

subset(table(data$A),table(data$A)>=10, drop=FALSE)

Problem: It doesn't really work. I end up with the deleted rows appearing again when I aggregate and group them in the end. Also other columns drop out somehow.

Sorry for writing absolutely not technical.

Any Ideas?

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
Vinzenz
  • 1
  • 1
  • 1

2 Answers2

0

Let's create a data.frame:

df1 <- data.frame(A=c(rep(1, 10), rep(2,5), rep(3,12), rep(4,6)),
                  B = rnorm(33),
                  C = rnorm(33, mean=100))

Now, you can solve your problem using split and look for groups of data containing 10 or more rows:

> tmp <- lapply(split(df1, df1$A), function(x) x[length(x$A)>=10, ])
> do.call(rbind, tmp)
     A            B         C
1.1  1  1.847173929 101.44195
1.2  1  0.140540889  98.84883
1.3  1 -0.588164254 100.89362
1.4  1  1.325389063  99.70454
1.5  1  1.168492910  99.31399
1.6  1  0.394623296 100.82031
1.7  1 -1.652867096 101.47617
1.8  1 -0.005714566 100.81326
1.9  1 -1.248685987  98.59261
1.10 1 -0.774900426 102.11714
3.16 3  0.475175282  99.00934
3.17 3  1.141757827 101.04925
3.18 3 -0.144273962  99.58414
3.19 3  0.621142217  98.72315
3.20 3  0.768943017  99.42351
3.21 3 -1.906744188  99.08345
3.22 3  0.388444691 100.07014
3.23 3 -0.845029096 101.66754
3.24 3  0.396626635  99.52390
3.25 3  0.597764453  99.76741
3.26 3 -0.794314145  99.90497
3.27 3  0.347058621 100.17985
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • split apply combine will mess with the original row sorting (unless groups are contiguous). I guess there's an `ave` analogue, though. – Frank May 09 '18 at 16:34
0

A tidyverse solution:

library(dplyr)

df1 <- data.frame(A=c(rep(1, 10), rep(2,5), rep(3,12), rep(4,6)),
              B = rnorm(33),
              C = rnorm(33, mean=100))


df1 %>%
    group_by(A) %>%
    add_tally() %>%
    filter(n >= 10)

We take the data, group it by the factor in A, then add a column tallying how many rows are in each group, then filter only the rows where there are 10 or more rows in that group.

divibisan
  • 11,659
  • 11
  • 40
  • 58