R: deleting the rows of the variables (factor) with a predetermined frequency of occurrence and automatic update levels of factors

Question

I have a set of data containing several variables. One of the variables - factorial contains the designation of groups - A, B, C, etc. The remaining variables are numeric.

 > data1
   Group Value
1      A    23
2      A    25
3      B     1
4      C    15
5      C    11
6      C    14
7      B     3
8      B     4
9      B     2
10     C    19

For further statistical calculations I want to exclude from the data set the lines that contain a particular group (e.g., X) with the proviso that the group is found in the dataframe n-number of times (e.g., less than 2 times).

The materials that I've seen before mainly concern delete rows with specific values and are not associated with the frequency of occurrence of group (factor) in the dataframe. Maybe I'm wrong? Sorry!

To remove specific rows in the "manual" mode, I use the following code:

data1 <- as.data.frame(
  lapply(subset(data1, !Group=="A"),
         function(x) if(is.factor(x)) factor(x) else x
  )
)

I would like to automate this process, and to exclude all levels factor (groups) with predetermined occurrence:

> data1
  Group Value
1     B     1
2     C    15
3     C    11
4     C    14
5     B     3
6     B     4
7     B     2
8     C    19

Addition

Mr. 'Akrun' brought the idea to use the following code:

tbl <- table(data1$Group)
data1 <- subset(data1, Group %in% names(tbl)[tbl>2])

This is what you need! I thank him for that! However, rezltate factor levels remain unchanged. To correct this, I am forced to use the record:

data1$Group = factor(data1$Group)

Surely there are ready-made solutions taking into account the case?

akrun · Accepted Answer · 2016-08-18T02:26:11.310

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data1)), grouped by 'Group', if the number of rows is greater than 2 (.N >2), we get the Subset of Data.table (.SD).

library(data.table)
setDT(data1)[, if(.N >2) .SD, by = Group]

Or with dplyr, after grouping by 'Group', filter the groups that have nrows (n()) greater than 2.

library(dplyr)
data1 %>%
      group_by(Group) %>%
      filter(n() > 2)

Or using base R, we get the frequency of 'Group' with table and %in% in subset to keep the groups.

tbl <- table(data1$Group)
subset(data1, Group %in% names(tbl)[tbl>2])

Ah, just beat me to it. Nice one – Simon Jackson Aug 18 '16 at 02:23 — Simon Jackson, Aug 18 '16 at 02:23

R: deleting the rows of the variables (factor) with a predetermined frequency of occurrence and automatic update levels of factors

1 Answers1