3

I have a set of data containing several variables. One of the variables - factorial contains the designation of groups - A, B, C, etc. The remaining variables are numeric.

 > data1
   Group Value
1      A    23
2      A    25
3      B     1
4      C    15
5      C    11
6      C    14
7      B     3
8      B     4
9      B     2
10     C    19

For further statistical calculations I want to exclude from the data set the lines that contain a particular group (e.g., X) with the proviso that the group is found in the dataframe n-number of times (e.g., less than 2 times).

The materials that I've seen before mainly concern delete rows with specific values ​​and are not associated with the frequency of occurrence of group (factor) in the dataframe. Maybe I'm wrong? Sorry!

To remove specific rows in the "manual" mode, I use the following code:

data1 <- as.data.frame(
  lapply(subset(data1, !Group=="A"),
         function(x) if(is.factor(x)) factor(x) else x
  )
)

I would like to automate this process, and to exclude all levels factor (groups) with predetermined occurrence:

> data1
  Group Value
1     B     1
2     C    15
3     C    11
4     C    14
5     B     3
6     B     4
7     B     2
8     C    19

Addition

Mr. 'Akrun' brought the idea to use the following code:

tbl <- table(data1$Group)
data1 <- subset(data1, Group %in% names(tbl)[tbl>2])

This is what you need! I thank him for that! However, rezltate factor levels remain unchanged. To correct this, I am forced to use the record:

data1$Group = factor(data1$Group)

Surely there are ready-made solutions taking into account the case?

Denis Efimov
  • 115
  • 1
  • 6

1 Answers1

4

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data1)), grouped by 'Group', if the number of rows is greater than 2 (.N >2), we get the Subset of Data.table (.SD).

library(data.table)
setDT(data1)[, if(.N >2) .SD, by = Group]

Or with dplyr, after grouping by 'Group', filter the groups that have nrows (n()) greater than 2.

library(dplyr)
data1 %>%
      group_by(Group) %>%
      filter(n() > 2)

Or using base R, we get the frequency of 'Group' with table and %in% in subset to keep the groups.

tbl <- table(data1$Group)
subset(data1, Group %in% names(tbl)[tbl>2])
akrun
  • 874,273
  • 37
  • 540
  • 662