0

So i have this dataset

str(pcol)
'data.frame':   3130486 obs. of  20 variables:
$ body     : Factor w/ 1623190 levels "","' i need to... '",..: 76837  ...
$ author   : Factor w/ 18164 levels "--Kai--","--sunshine--",..: 11455 6643 8117 832 ...
$ ups      : int  2 7 1 1 1 1 2 4 2 1 ...
....

Making a table shows the following:

table(pcol$author):
AuthornameX   AuthornameY   AuthornameZ ...
148           87            102

 'table' int [1:18164(1d)] 129 5 152 67 18 25 58 319 44 204 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:18164] "--Kai--" "--sunshine--" "-0---0-" "-73-" ...

So now i want to create a new dataset with just authors who are in the dataset more than 100 times.

I tried the following:

x <- subset(pcol, length(pcol$author) > 100 )
'table' int [1:2634(1d)] 129 152 319 204 157 177 198 106 144 437 ...
 attr(*, "dimnames")=List of 1
..$ : chr [1:2634] "--Kai--" "-0---0-" "-Lolrax-" "-PTM-" ...

This way i limited the authors, who have numbers over 100. But now I have the problem of how to substract these authors from the original dataset.

I tried this:

> y <- subset(pcol, pcol$authors == x)

But that leaves me with a blank dataframe with 0 observations.

So: how do i change the original dataset to a new one, only with authors, who appear over 100 times?

My question is similar to this one, so potentially a duplicate. Althought the question was answered, I was not able to transfer the solution there to my problem. That is why I pose my question.

Here is a 10.000 rows sample of my data set

Community
  • 1
  • 1
Arthur Pennt
  • 155
  • 1
  • 14
  • Aggregate and add new column that shows the count per Author, then use subset on that column. Also add [reproducible example](http://stackoverflow.com/questions/5963269). – zx8754 Jul 25 '16 at 08:03
  • Try `y <- subset(pcol, pcol$authors %in% x)`. – Alex Jul 25 '16 at 08:06
  • 1
    or with `library(dplyr) ; pcol %>% group_by(author) %>% filter(n() > 100)` – Sotos Jul 25 '16 at 08:08
  • Possible duplicate of [Count rows for selected column values and remove rows based on count in R](http://stackoverflow.com/questions/19412337/count-rows-for-selected-column-values-and-remove-rows-based-on-count-in-r) – ArunK Jul 25 '16 at 08:59

2 Answers2

2

Using the data.table package one gets

require(data.table)
setDT(pcol)

Find the authors with more than 100 occurrences

author_sel <- pcol[, .N, by = .(author)][N > 100]
pcol[author %in% author_sel$author]
streof
  • 145
  • 3
  • 8
1

A base solution could be

subset(pcol, author %in% names(which(table(pcol$author)>100)))

Perhaps you should consider learning dplyr. The dplyr solution is easier to read and faster to run on your computer.

shayaa
  • 2,787
  • 13
  • 19