6

I use SPSS modeler v18.2.1 with R v3.5.1 (or v3.3.3) using Essentials for R 18.2.1.

I'm trying to make "Extension Transform (R syntax)" nodes to deal with some problems difficult for SPSS (future: make them Extension Bundles). I want them to add multiple cols, make a new data, etc AND give a next node data.frame. But the data.frame are incorrectly recognized by SPSS nodes (i.e., output of a next table nodes are different from the console output of print(modelerData) ).
How to do it ? (or it is a bug ?)

Any help would be greatly appreciated. Below is a reproducible simple example;

[preparation R env and data (please do it in pure R)]

# if not installed 
install.packages(randomForest)

set.seed(1)  # to reproduce
write.csv(iris[sort(sample(1:150, 100)), ], "iris_train_seed1.csv", row.names = FALSE)

[My node flow]
enter image description here

[R code of Extension Transform]

### library ###
library(randomForest)

# make_model
set.seed(1)
modelerModel <- randomForest(formula = Species ~ . ,
                             data = modelerData,
                             ntree = 100)

#### predict
pred_forest <- data.frame(pred = predict(modelerModel, 
                                         newdata = modelerData))
prob_forest <- as.data.frame(predict(modelerModel, 
                                     newdata = modelerData,
                                     type = "prob"))


# overwriting modelerData
modelerData <- cbind(modelerData, pred_forest, prob_forest)

# function definition to make modelerDataModel 
getMetaData <- function (data) {
  if (dim(data)[1]<=0) {
    print("Warning : modelerData has no line, all fieldStorage fields set to strings")
    getStorage <- function(x){return("string")}
  } else {
    getStorage <- function(x) {
      res <- NULL
      #if x is a factor, typeof will return an integer so we treat the case on the side
      if(is.factor(x)) {
        res <- "string"
      } else {
        res <- switch(typeof(unlist(x)),
                      integer = "integer",
                      #  integer = "real",      
                      double = "real",
                      character = "string",
                      "string")
      }
      return (res)
    }
  }
  col = vector("list", dim(data)[2])
  for (i in 1:dim(data)[2]) {
    col[[i]] <- c(fieldName=names(data[i]),
                  fieldLabel="",
                  fieldStorage=getStorage(data[[i]]), 
                  fieldMeasure="",
                  fieldFormat="",
                  fieldRole="")
  }
  mdm<-do.call(cbind,col)
  mdm<-data.frame(mdm)
  return(mdm)
}

# overwriting modelerDataModel
modelerDataModel <- getMetaData(modelerData)

# to check
print(dim(modelerData))
print(head(modelerData))
print(dim(modelerDataModel))
print(modelerDataModel)

[Console Output of "to check" part (print(modelerData) is my desired output of table node)]

# print(dim(modelerData))
[1] 100   9

# print(head(modelerData))
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   pred setosa
1          4.9         3.0          1.4         0.2  setosa setosa      1
2          4.7         3.2          1.3         0.2  setosa setosa      1
3          5.0         3.6          1.4         0.2  setosa setosa      1
4          5.4         3.9          1.7         0.4  setosa setosa      1
5          4.6         3.4          1.4         0.3  setosa setosa      1
6          5.0         3.4          1.5         0.2  setosa setosa      1
  versicolor virginica
1          0         0
2          0         0
3          0         0
4          0         0
5          0         0
6          0         0

# print(dim(modelerDataModel))
[1] 6 9

# print(modelerDataModel)
                       X1          X2           X3          X4      X5     X6
fieldName    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   pred
fieldLabel                                                                   
fieldStorage         real        real         real        real  string string
fieldMeasure                                                                 
fieldFormat                                                                  
fieldRole                                                                    
                 X7         X8        X9
fieldName    setosa versicolor virginica
fieldLabel                              
fieldStorage   real       real      real
fieldMeasure                            
fieldFormat                             
fieldRole  

[The output of table node (why 11cols being ???)]
enter image description here

eli-k
  • 10,898
  • 11
  • 40
  • 44
cuttlefish44
  • 6,586
  • 2
  • 17
  • 34

2 Answers2

2

This might be because your Species and pred columns are of type factor not character and looking at the SPSS nodes docs, they don't have a type for factor.. Since factor has two levels.. the additional 2 columns on the output table node could be representing the factor level for those two columns as it's trying to coerce to string. You need them as a factor type for the predict function at the start of your script, but right before you export the table node try:

modelerData[] <- lapply(modelerData, function(x) if (is.factor(x)) as.character(x) else {x})

I don't have SPSS to be able to test this theory, but hopefully that solves your problem or gets you a little closer.

Anna Nevison
  • 2,709
  • 6
  • 21
  • Thank you for your response. As far as I see, SPSS handle `factor` appropriately. Unfortunately your idea doesn't work. – cuttlefish44 Jun 22 '20 at 04:34
  • @cuttlefish44 the only main difference I can see between the cbind of the first non-working and second working method is that in the second you are binding two data frames and a matrix, in the first you are binding three data frames. You should try converting `prob_forest` to matrix and then cbind and see if that changes anything – Anna Nevison Jun 22 '20 at 15:31
  • I agree with you. But it dosn't work... To be honest, I gave up and decide to take another method, not give data on SPSS but via csv file. I appreciate your kindness ! – cuttlefish44 Jun 24 '20 at 04:47
  • @cuttlefish44 no worries at all- this really stumped me, but I am glad you found another method. If it's a bug in Rlang though, you should file it as an issue and link this post! – Anna Nevison Jun 24 '20 at 14:10
0

I found a method to solve my simple example... it is hard to understand. From the R lang's perspective, it is a bug. (but this method doesn't work in other situations, does anyone know how to avoid this bug??)

questions_modelerData <- cbind(modelerData, pred_forest, prob_forest)

modelerData <- cbind(modelerData, pred_forest, 
                     setosa = prob_forest[,1], 
                     versicolor = prob_forest[,2],   
                     virginica = prob_forest[,3])

identical(questions_modelerData, modelerData)
# [1] TRUE
# but this modelerData works unlike the question's.

Damn it.

cuttlefish44
  • 6,586
  • 2
  • 17
  • 34