-1

Given below is the pseudo data -training data

data

I am implementing a random forest algorithm for the binary classification in R.

rf=randomForest(Default~.,data=traindata,ntree=300,mtry=18,importance=TRUE)

I want to fit the model on individual personalid.

Like for personid 112 a prediction of either 1 or 0.

Right now my model takes in the entire data and gives different predictions for each month. I want to get predictions based on personid.

A single prediction for a single id not for different months.

My total number of personid is 265.

will using group_by() from dplyr package help me?.

As the number of personid is large, also how will I predict on the new data?.

*condition I cannot average the data to flatten it out as this is a financial data.

danishxr
  • 69
  • 2
  • 12

1 Answers1

0

You can use dplyr and tidyr to get all your data on 1 row per personID. See below example below. You will get a lot of extra variables to use in the rf model which is probably what you need.

library(dplyr)
library(tidyr)


spread_data <- df %>%
  gather(Balances, value, starts_with("Balance")) %>% 
  unite(Bal_month, Balances, Month) %>% 
  spread(Bal_month, value)

personid Default Balance1_Month1 Balance1_Month2 Balance1_Month3 Balance1_Month4 Balance2_Month1 Balance2_Month2 Balance2_Month3
1      112       1          123465        45343456              NA              NA          234567         5498731              NA
2      113       0          534564         9616613            6164              NA           64613            3496         3189479
3      114       1             621         1615494           32165              NA            3168              97          165197
4      115       0       123164964           97946           21679          791639           47643            1679             179
  Balance2_Month4
1              NA
2              NA
3              NA
4          167976

More reading on casting: how to spread or cast multiple values in r or can the value.var in dcast be a list or have multiple value variables?

example data used:

df <-
  structure(
    list(
      personid = c(
        112L,
        112L,
        113L,
        113L,
        113L,
        114L,
        114L,
        114L,
        115L,
        115L,
        115L,
        115L
      ),
      Month = c(
        "Month1",
        "Month2",
        "Month1",
        "Month2",
        "Month3",
        "Month1",
        "Month2",
        "Month3",
        "Month1",
        "Month2",
        "Month3",
        "Month4"
      ),
      Balance1 = c(
        123465,
        45343456,
        534564,
        9616613,
        6164,
        621,
        1615494,
        32165,
        123164964,
        97946,
        21679,
        791639
      ),
      Balance2 = c(
        234567,
        5498731,
        64613,
        3496,
        3189479,
        3168,
        97,
        165197,
        47643,
        1679,
        179,
        167976
      ),
      Default = c(1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L)
    ),
    .Names = c("personid", "Month", "Balance1", "Balance2", "Default"),
    class = "data.frame",
    row.names = c(NA,-12L)
  )
phiver
  • 23,048
  • 14
  • 44
  • 56