0

i am testing new dataset which is in csv format. firstly i built a trained system by using

matrix <- create_matrix(train["Title"], language="english", weighting=tm::weightTfIdf)
container <- create_container(matrix,train$TagId,trainSize=1:x, testSize=(x+1):nrow(train),virgin=FALSE)

# create maxent model using SVM
maxent_model <- train_models(container,algorithms=c("SVM"))
maxent_results <- classify_models(container,maxent_model)

# test the model on test data
maxenttestData = train[(x+1):nrow(train),]
maxenttestData = data.frame(maxenttestData, maxent_results)
write.csv(maxenttestData, "MAXENT.csv", row.names = FALSE)

to test the system with newdata set i am using

new = read_csv("new.csv")
new$Title = toupper(new$Title)
new$Title = gsub("[<].*[>]", "", as.character(new$Title))
new$Title = gsub("&amp", "", new$Title)
new$Title = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", new$Title)
new$Title = gsub("@\\w+", "", new$Title)
new$Title = gsub("[[:punct:]]", "", new$Title)
new$Title = gsub("[[:digit:]]", "", new$Title)
new$Title = gsub("http\\w+", "", new$Title)
new$Title = gsub("[ \t]{2,}", "", new$Title)
new$Title = gsub("^\\s+|\\s+$", "", new$Title)
#write.csv(new, "preprocess_new.csv", row.names = FALSE)
matrix <- create_matrix(new["Title"], language="english", weighting=tm::weightTfIdf)
container <- create_container(matrix, new$TagId, trainSize=NULL, testSize=1:nrow(new), virgin=FALSE)
maxent_results <- classify_models(container,maxent_model)
write.csv(maxent_results2, "MAXENT_res.csv", row.names = FALSE)

But it is showing error like this

maxent_results <- classify_models(container,maxent_model) Error in predict.svm(model, container@classification_matrix, prob = TRUE, : test data does not match model !

desertnaut
  • 57,590
  • 26
  • 140
  • 166
suman
  • 11
  • 5
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Do your test/train data sets have the same columns? If they have factors, do the factors have the same levels? – MrFlick Apr 09 '19 at 13:52
  • test/train data sets has same columns but in in maxent_model column is increased when code run to trained model – suman Apr 09 '19 at 13:55

1 Answers1

0

Look at the first gsub and the result of the code below:

aaa <- "<html><title>X</title>all webpage content is between < and >  </html>"
aaa <- gsub("[<].*[>]", "", aaa)
aaa
[1] ""

After this operation, there is nothing to classify if the text is a block of HTML code.

  • This is not a problem, I print Train$Title After this statement, it has title to classify – suman Apr 10 '19 at 09:15
  • You have not given us any reproducible example, so I was guessing. I supposed you had not checked the result after all gsubs, because you transform all letters to uppercase and then use lowercase letters in "http", "via". – Grzegorz Sionkowski Apr 10 '19 at 09:29