0

Hello I am using R studio to filter out varieties of wine that appear less that 5000 times in a dataset.

I have run the below function -

#create new data frame with varities greater than 5000
wineVar <- setDT(wineNew)[, if(.N > 5000) .SD, by = variety]
#list the unique varieties to show theres only 5
unique(wineVar$variety)

However when I try to see how many levels there are I still get the other 632 values.

[1] Cabernet Sauvignon       Pinot Noir               Chardonnay              
[4] Bordeaux-style Red Blend Red Blend               
632 Levels: Žilavka Agiorgitiko Aglianico Aidani Airen Albana Albarín ... Zweigelt

Is there a way to completely remove these as it is causing issues with my training set - ie the training set still sees the values but with no data for dropped varieties.

M--
  • 25,431
  • 8
  • 61
  • 93
theJ
  • 395
  • 5
  • 25

1 Answers1

1

I think what you are looking for is this. You almost there.

wineVar <- setDT(wineNew)
wineVar <- wineVar[, .SD[.N > 5000], by = variety]
wineVar[, Variety:=as.factor(as.character(Variety))]
JeanVuda
  • 1,738
  • 14
  • 29