3

I need to change the value of elements in a vector. But I want to change only the elements for which there are less then n instances.

I used this metodh, with Data$GENE being the vector to be changed.

Data$GENE[which(Data$GENE %in% names(table(Data$GENE)[table(Data$GENE) < 10]))] <<- 'other'

It's a bit convoluted, is there a more succint way?

UPDATE: answering to the comments below: actually is a quite easy case!

> vec <- c(rep('foo', 5), rep('foo1', 2), rep('foo2', 1), rep('foo3', 3), rep('bar', 6))
> table(vec)
vec
 bar  foo foo1 foo2 foo3 
   6    5    2    1    3 
> vec[which(vec %in% names(table(vec)[table(vec) < 5]))] <- 'other'
> table(vec)
vec
  bar   foo other 
    6     5     6
Bakaburg
  • 3,165
  • 4
  • 32
  • 64
  • 2
    Can you make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output? It's easier to help improve code when we can actually run it. – MrFlick Dec 22 '14 at 08:13
  • 2
    I would stick with `table` or `summary` solutions as `ave` is just doing an unnecessary loop – David Arenburg Dec 22 '14 at 08:46

4 Answers4

5

The summary method for factors has support for this:

summary(factor(vec),maxsum=sum(table(vec)>=5)+1)
    bar     foo (Other) 
      6       5       6 
James
  • 65,548
  • 14
  • 155
  • 193
3

I would just do it in 2 steps so it's less convoluted as you say and you only need to compute the table once. Also, you don't need which as you use it in your approach.

y <- table(vec)
vec[vec %in% names(y[y < 5])] <- "other"
talat
  • 68,970
  • 21
  • 126
  • 157
2

You can do this easily with data.table.

library(data.table)
data(mtcars)
setDT(mtcars, keep.rownames = T)  # set data.frame as data.table

# add a count column with .N, then chain with [count < ...]
mtcars[, count := .N, by = cyl][count < 14]
talat
  • 68,970
  • 21
  • 126
  • 157
Henk
  • 3,634
  • 5
  • 28
  • 54
  • Ok, now I feel confused :) I didn't know at all this syntax! what is data.table? how count you use 3 indexes in the square columns? what does it means := ? – Bakaburg Dec 22 '14 at 08:25
  • 1
    That doesn't answer the question though. This is just doing subsetting. There is no values renaming here. – David Arenburg Dec 22 '14 at 08:43
  • A possible modification would be `table(as.data.table(vec)[, count := .N, by = vec][count < 5, vec := "other"]$vec)` though it seems quite awkword here – David Arenburg Dec 22 '14 at 09:09
  • Assign by reference. Read [this](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf) – David Arenburg Dec 22 '14 at 10:01
2

I think what you're describing can be accomplished with ave in base R. Here we replace those observations with less than three observations.

vec[ave(seq_along(vec), vec, FUN=length) < 5] <- "other"
vec

We can wrap this in a friendly function

haslessthan <- function(x, n) ave(seq_along(x), x, FUN=length) < n
vec[haslessthan(vec, 5)] <- "other"

Either way the result is

vec
  bar   foo other 
    6     5     6 
MrFlick
  • 195,160
  • 17
  • 277
  • 295