0

I have a dataframe consisting of observed and modelled data that I am trying to do a RMSLE metric on. The data is stored in one text file, which I can read in and has the format similar to this:

Group Time Observed Predicted
A 2 190 312
A 3 174 345
A 6 150 290
A 12 85 217
B 4 300 725
B 9 113 426
B 13 120 393
B 23 97 263

In reality I have a lot more data points and many more groups to go through to calculate the RMSLE--this is just some dummy data, and have been using MLmetrics/dplyr. I have messed around for too long, trying to get something to work, using a 'for loop' with a custom function, trying to use a 'split' function and saving data to a file to read back in to do each group individually. Some description/code is below. I have been able to use RMSLE for one group successfully as follows (i.e. one group in its own file):

Set up two vectors
y_pred <- Model_Comparison$Predicted
print (y_pred)
y_true <- Model_Comparison$Observed
print(y_true)

Calculate RMSLE: 
RMSLE(y_pred = y_pred, y_true = y_true)

I have tried using a for loop, and have tried to put the RMSLE in different places including as a user defined function, but run into troubles having to define the y_pred and y_true, and I believe it is because by predefining the y's it takes it as a single vector to pass into the loop resulting in one value only.

ModFull <- read.table(C:/...)

spp.l <- split(ModFull, ModFull$Group)

#For loop to look at first few lines of each spp in list
for(Group in spp.l) 
  {
  print(head (Group))
  }

Define y's

y_pred <- ModFull$Predicted
y_true <- ModFull$Observed

#Now get statistic for each Group using a for loop
res <- list()
for (n in names (spp.l))
RMSLE(y_true = sqrt(mean((log(1 + y_true) - log(1 + y_pred))^2))
  {
   dat <- spp.l[[n]]
   res[[n]] <- data.frame(Group=n,
      RMSLE,
      n.samples=nrow(dat))
    }

print(res)

The above code results in the same RMSLE value for all Groups, but the structure is fine.

Group RMSLE n.samples
A 0.929 4
B 0.929 4

I have also tried to use a split approach, and tried to save the individual files to ".Rdata following these methods (R - split data frame and save to different files) but these result in corrupted files: Error in load(name, envir = .GlobalEnv) : bad restore file magic number (file may be corrupted) -- no data loaded In addition: Warning messages: 1: In readChar(con, 5L, useBytes = TRUE) : truncating string with embedded nuls 2: file ‘Group1.Rdata’ has magic number 'X' Use of save versions prior to 2 is deprecated

Lastly I have tried to follow some work on RMSE values (How to calculate RMSE for groups of data from csv) but run into the same problem; trying to define the y_pred and y_true values.;

Any and all help would be appreciated,

  • It would be easier to help if you create a small reproducible example along with expected output. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Jan 25 '21 at 03:11
  • @ Ronak, hopefully my edits make it a little more clear. – Random Lee Jan 25 '21 at 10:51

1 Answers1

0

You are using the unsplitted full data.frame to assign your y_pred and y_true. You should assign y_pred and y_true inside the for loop using the splitted data.frame.

#Now get statistic for each Group using a for loop
res <- list()
for (n in names (spp.l)){
   dat <- spp.l[[n]]
   y_true = dat$Observed
   y_pred = dat$Predicted
   RMSLE = RMSLE(y_pred = y_pred, y_true = y_true)
   res[[n]] <- data.frame(
      Group=n,
      RMSLE,
      n.samples=nrow(dat)
   )
}
Wawv
  • 371
  • 2
  • 6
  • That worked great Wawv, thanks for your time on this. I had actually tried to put the y's inside before, but hadn't changed them to dat$ (they were ModFull$, and resulted in the same value for all groups)--that was the key thing to do. – Random Lee Jan 25 '21 at 12:33