Subsetting a factor on amount of observations in R

Question

So i have this dataset

str(pcol)
'data.frame':   3130486 obs. of  20 variables:
$ body     : Factor w/ 1623190 levels "","' i need to... '",..: 76837  ...
$ author   : Factor w/ 18164 levels "--Kai--","--sunshine--",..: 11455 6643 8117 832 ...
$ ups      : int  2 7 1 1 1 1 2 4 2 1 ...
....

Making a table shows the following:

table(pcol$author):
AuthornameX   AuthornameY   AuthornameZ ...
148           87            102

 'table' int [1:18164(1d)] 129 5 152 67 18 25 58 319 44 204 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:18164] "--Kai--" "--sunshine--" "-0---0-" "-73-" ...

So now i want to create a new dataset with just authors who are in the dataset more than 100 times.

I tried the following:

x <- subset(pcol, length(pcol$author) > 100 )
'table' int [1:2634(1d)] 129 152 319 204 157 177 198 106 144 437 ...
 attr(*, "dimnames")=List of 1
..$ : chr [1:2634] "--Kai--" "-0---0-" "-Lolrax-" "-PTM-" ...

This way i limited the authors, who have numbers over 100. But now I have the problem of how to substract these authors from the original dataset.

I tried this:

> y <- subset(pcol, pcol$authors == x)

But that leaves me with a blank dataframe with 0 observations.

So: how do i change the original dataset to a new one, only with authors, who appear over 100 times?

My question is similar to this one, so potentially a duplicate. Althought the question was answered, I was not able to transfer the solution there to my problem. That is why I pose my question.

Here is a 10.000 rows sample of my data set

Aggregate and add new column that shows the count per Author, then use subset on that column. Also add [reproducible example](http://stackoverflow.com/questions/5963269). — zx8754, Jul 25 '16 at 08:03
or with `library(dplyr) ; pcol %>% group_by(author) %>% filter(n() > 100)` — Sotos, Jul 25 '16 at 08:08
Possible duplicate of [Count rows for selected column values and remove rows based on count in R](http://stackoverflow.com/questions/19412337/count-rows-for-selected-column-values-and-remove-rows-based-on-count-in-r) — ArunK, Jul 25 '16 at 08:59

score 2 · Accepted Answer · answered Jul 25 '16 at 09:20

2

Using the data.table package one gets

require(data.table)
setDT(pcol)

Find the authors with more than 100 occurrences

author_sel <- pcol[, .N, by = .(author)][N > 100]
pcol[author %in% author_sel$author]

answered Jul 25 '16 at 09:20

streof

145
3
8

shayaa · Answer 2 · 2016-07-25T09:17:02.800

1

A base solution could be

subset(pcol, author %in% names(which(table(pcol$author)>100)))

Perhaps you should consider learning dplyr. The dplyr solution is easier to read and faster to run on your computer.

edited Jul 25 '16 at 09:17

answered Jul 25 '16 at 08:09

shayaa

2,787
13
19

I added a sample data set. I tried your command lines, but alas there were plenty of authors with less than 100 appearances. Maybe the sample dataset gives you a better understanding of what i want to know – Arthur Pennt Jul 25 '16 at 08:57
Did this fix your issue? – shayaa Jul 25 '16 at 19:39
Alas it did not. I used stefan8888´s proposal. But anyway, thank you for your help! – Arthur Pennt Jul 26 '16 at 07:49
I fixed it right when you made your first comment. nbd, @stefan8888 had a good solution. – shayaa Jul 26 '16 at 07:53
1

Ah ok. I did not notice that. I just tried your command, and it worked as well. Thank you! – Arthur Pennt Jul 26 '16 at 08:01

Subsetting a factor on amount of observations in R

2 Answers2