0

I have a data set similar to this one:

x <- sample(c("A", "B", "C", "D"), 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))
y <- sample(1:40, 1000, replace=TRUE)
d <- data.frame(x,y)

str(d)
'data.frame':   1000 obs. of  2 variables:
$ x: Factor w/4 levels "A","B","C","D": 1 3 3 2 3 3 3 3 4 3 ...
$ y: int  28 35 14 4 34 36 30 35 26 9 ...

table(d$x)
A   B   C   D 
115 204 637 44 

So in my real data set i have multiple thousands of these category (A, B, C, D).

The str() of my real dataset

str(realdata)
data.frame':    346340 obs. of  91 variables:
$ author       : Factor w/ 42590 levels "-jon-","--LZR--",..: 1962 3434 1241 7666 6235 2391 1196 2779 1881 339 ...
$ created_utc  : Factor w/ 343708 levels "2015-05-01 02:00:41",..: 14815 23163 2281 3569 5922 7211 15783 5512 13485 8591 ...
$ group    : Factor w/ 5 levels "xyz","abc","bnm",..: 2 2 2 2 2 2 2 2 2 2 ...
....

Now i want to subset the data, so i have only the rows of those $authors (or $x in the d dataframe) in my new dataframe that have more than 100 entries in total.

I tried the following:

dnew <- subset(realdata, table(realdata$author) > 100)

It gives me a result, but it seems the not all entries of the authors were included. Although it should be way more, i just get 1.3% of the rows of the complete dataset. I checked it manually (with excel) and it should be way more than that (approx. 30%). The manual analysis showed that 1.2 % of $author stand for 30% of the entries. So it seems he just gave me one row with the $author who has more than 100 entries, but not all of his entries.

Do you know of a way to fix this?

Sowmya S. Manian
  • 3,723
  • 3
  • 18
  • 30
Arthur Pennt
  • 155
  • 1
  • 14

2 Answers2

3

We can do this easily with data.table. Convert the 'data.frame' to 'data.table' (setDT(d), grouped by 'x', if the number of observations is greater than 100, we Subset the Data.table (.SD)

library(data.table)
ddt <- setDT(d)[, if(.N > 100) .SD, x]

Or if we are using dplyr, the same approach can be used.

library(dplyr)
dpl <- d %>%
        group_by(x) %>%
        filter(n() > 100) %>%
        droplevels()
str(dpl)
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 866 obs. of  2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
# $ y: int  25 25 13 11 2 32 12 15 12 3 ...

Also, in using the base R, the table can be helpful

v1 <- table(d$x)
d1 <- subset(d, x %in% names(v1)[v1 > 100])

As the column 'x' is factor, when we subset the dataset, the levels persist, to remove that use droplevels

d2 <- droplevels(d1)

As the OP didn't set the seed, the output will be different for each person.

str(d2)
#'data.frame':   866 obs. of  2 variables:
#$ x: Factor w/ 2 levels "B","C": 1 1 2 1 1 2 2 2 2 2 ...
#$ y: int  25 25 13 11 2 32 12 15 12 3 ...
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I tried the version with dplyr. It works with the test dataframe. Alas I get the error message, when applying it to my real data set: Error: invalid subscript type 'integer'. I checked, the $author is a factor, as is x. I used the following command auth <- realdata %>% group_by(realdata$author) %>% filter(n() > 100) – Arthur Pennt Aug 20 '16 at 08:31
  • @8bytez You are not even using the syntax in my code `group_by(realdata$author)` is not used by me. – akrun Aug 20 '16 at 12:05
  • hey @akrun. sorry, i tried the rest of your code after trying the other version. It probably works as well. Thank you for helping me! As i said, your code worked perfectly fine with the test sample. Alas, when i changed the variable numbers to that of my real data set, it gave me this error message. Thats why the syntax code which is not from you. It is my real data variables. – Arthur Pennt Aug 20 '16 at 15:06
  • @8bytez No problem. I guess it has to do with the version of `dplyr` I am using 0.5.0 – akrun Aug 20 '16 at 15:07
1

I. Data frame d with four levels

  table(d$x)

  #  A   B   C   D 
  # 92 232 630  46 

II. Checking which level has greater than 100 records

  which(table(d$x)>100)
  # B C 
  # 2 3 

III. Subsetting d data frame having only records belonging to levels which have greater than 100 records ie. level B and level C

  result <- d[ d$x %in%  names(table(d$x))[table(d$x) > 100] , ]
  dim(result)
  # [1] 862   2

  str(result)
  # 'data.frame':   862 obs. of  2 variables:
  #  $ x: Factor w/ 4 levels "A","B","C","D": 3 2 3 3 2 2 2 3 3 3 ...
  #  $ y: int  29 32 27 40 30 38 8 16 2 23 ...

Level A and D still persists with 0 records

  table(result$x)

  #   A   B   C   D 
  #   0 232 630   0 

IV. Removing the levels with 0 records using factor()

  result$x <- factor(result$x)

  str(result)
  # 'data.frame':   860 obs. of  2 variables:
  #  $ x: Factor w/ 2 levels "B","C": 2 2 1 2 1 2 2 2 1 2 ...
  #  $ y: int  29 32 27 40 30 38 8 16 2 23 ...

  table(result$x)
  #   B   C 
  # 232 630 
Sowmya S. Manian
  • 3,723
  • 3
  • 18
  • 30
  • 1
    @akrun I have no clue you have added that, for my answer I used combination of two links, I tried it and then posted, here it is http://stackoverflow.com/questions/24835233/subset-based-on-frequency-level http://stackoverflow.com/questions/1195826/drop-factor-levels-in-a-subsetted-data-frame Your answer for base R also works I checked that too, the levels were still persisting in data, so I dropped them following the second link. Thank you. – Sowmya S. Manian Aug 20 '16 at 13:22