0

I have a data set where in 1 column there are 142 unique values. As part of building a predictive model, I want to create dummy variables for that column. But instead of creating 142 dummy variables, I first want to club the values which behaves similarly with respect to the response variable. The code which I used looks like below

round(tapply(train_data$Price,train_data$Suburb,mean),0)

This gives me 142 different elements in the array, which is time consuming if I manually go through to find the similar values. A snippet of my outpout is pasted below:

round(tapply(train_data$Price,train_data$Suburb,mean),0)
        Abbotsford         Aberfeldie       Airport West 
           1057934            1235150             707542 
       Albert Park             Albion         Alphington 
           1919014             547711            1188880 
            Altona       Altona North           Armadale 
            757866             728127            1542430 
        Ascot Vale          Ashburton            Ashwood 
            968702            1595275            1049184 
  Avondale Heights          Balaclava             Balwyn 
            792321             675133            1912896 
      Balwyn North          Bellfield          Bentleigh 
           1769984             798778            1282869 
    Bentleigh East           Box Hill          Braybrook 
           1038886            1138650             646845 
          Brighton      Brighton East           Brooklyn 
           1864928            1607299             542182 
         Brunswick     Brunswick East     Brunswick West 
            952350             874927             744986 
           Bulleen            Burnley            Burwood 
           1142944            1150902            1167023 
        Camberwell      Campbellfield         Canterbury 
           1761263             447600            2284188 
           Carlton      Carlton North           Carnegie 
           1062721            1436615             915587 
         Caulfield     Caulfield East    Caulfield North 
            981417            1099000            1055575 
   Caulfield South          Chadstone       Clifton Hill 
           1119571            1007909            1049742 
            Coburg       Coburg North        Collingwood 
            851215             770902             858415 
          Cremorne          Docklands          Doncaster 
            943731             937500            1210059 
         Eaglemont     East Melbourne        Elsternwick 

How can I write a code which groups all the values based on condition like the mean of which falls between 600000-699999, 700000-799999 and so on?

theforestecologist
  • 4,667
  • 5
  • 54
  • 91
Biswa
  • 343
  • 2
  • 4
  • 14
  • Without the complete picture of the data is difficult to answer. You should add [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250). You should also look at `?cut` R function as it could apply to your problem. – Marcelo Oct 25 '17 at 03:31
  • In a simple way to say in the above output that i have pasted i want only those column values where the mean falls in a particular range say 600000-700000 and so on. Hope i could explain properly – Biswa Oct 25 '17 at 03:38

1 Answers1

1

I got the code which completely served my purpose

subset(aggregate( Price ~ Suburb, 
                  train_data, 
                  function(x) ifelse (mean(x)>600000 & mean(x)<700000 ,1,0) ),Price=="1")
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
Biswa
  • 343
  • 2
  • 4
  • 14