0

I have a dataset of forest polygons and I am attempting to compare the Field classifications with the Map classifications using a confusion matrix. The only package I could find that would run on a larger dataset (more than 2 classification options) and could compare text values was in the package 'mda'. I have run the package 'mda' and used the code for 'confusion'.

The provided example with the package is...

data(iris)
irisfit <- fda(Species ~ ., data = iris)
confusion(predict(irisfit, iris), iris$Species)
                 Setosa       Versicolor       Virginica
Setosa            50              0               0
Versicolor         0             48               1
Virginica          0              2              49

attr(, "error"):
[1] 0.02

I run mine as simply

data(Habitat)
confusion(Habitat$Field,Habitat$Map)

Which gives me a confusion matrix output similar (but not nearly as neat) as the code example provided. At this point I get lost. I have 2 results with mine.

attr(,"error")
[1] 0.3448276
attr(,"mismatch")
[1] 0.889313

Error I understand, mismatch however, I cannot seem to find any hint of online or within the literature that comes with the package. I doubt having such a high "mismatch" value is good, but I have no idea how to interpret it. I figure this is probably a fairly specific question that perhaps could only be answered by someone that has worked with this package before, but if anyone knows, or has a hint on how to find out, I would greatly appreciate it.

Thanks, Ayden

EDIT - To include clips of my dataset, showing what may be the mismatch (suspect it means consistent misclassifications). Shows clips of the most consistent misclassification.

structure(list(Field = structure(c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 8L), .Label = c("Black Spruce ", "Clearcut ", 
"Deciduous ", "Jack Pine ", "Lowland Conifer ", "Marshwillow ", 
"Mixed Conifer ", "Open Muskeg ", "Rock ", "Treed Muskeg ", "Upland Conifer ", 
"Young Conifer", "Young Deciduous"), class = "factor"), Map = structure(c(7L, 
7L, 7L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 13L, 13L, 13L, 6L), .Label = c("Black     Spruce", "Clearcut", "Deciduous", "Jack Pine", "Lowland Conifer", "Marshwillow", 
"Mixed Conifer", "Open Muskeg", "Rock", "Treed Muskeg", "Upland Conifer", 
"Young Conifer", "Young Deciduous"), class = "factor")), .Names = c("Field", 
"Map"), row.names = 143:156, class = "data.frame")
HeidelbergSlide
  • 293
  • 3
  • 13

1 Answers1

1

It seems to mean that the variables don't share a common set of values. If one is predicting the other, it is predicting values that are not present (or the other way round). Mismatch seems to be the proportion of cases assigned a value not present in the levels of the other variable.

In the iris dataset example you post, we can elicit this same output if we introduce a new value to one of the variables in the confusion matrix. Since they're factors, we need to create a new factor level first.

data(iris)
irisfit <- fda(Species ~ ., data = iris)
iris$Predict<-predict(irisfit, iris)
iris$Predict=factor(iris$Predict,levels= c("setosa", "versicolor",
      "virginica","monsterosa"))  #adding a new level 'monsterosa'
iris$Predict[1]<-"monsterosa"  #assign it to one of the observations

Now we can re-run the confusion function and get a mismatch:

confusion(iris$Predict, iris$Species)
            true
predicted    setosa versicolor virginica
  setosa         49          0         0
  versicolor      0         48         1
  virginica       0          2        49
  monsterosa      1          0         0
attr(,"error")
[1] 0.02013423
attr(,"mismatch")
[1] 0.006666667

And if we refactor the other variable to include all levels present in both variables, the mismatch goes away:

iris$Species=factor(iris$Species,levels= c("setosa", "versicolor",
      "virginica","monsterosa"))
confusion(iris$Predict, iris$Species)
            true
predicted    setosa versicolor virginica monsterosa
  setosa         49          0         0          0
  versicolor      0         48         1          0
  virginica       0          2        49          0
  monsterosa      1          0         0          0
attr(,"error")
[1] 0.02666667

I would compare as.character(unique(Habitat$Field)) and as.character(unique(Habitat$Map)) to track it down. The as.character is not needed, but makes it easy to read.

Now that you've added data, I see the issue seems to be that you have trailing spaces at the end of some variables and double spaces between words in others.

# see problem
as.character(levels(Habitat$Field))
as.character(levels(Habitat$Map))

# fix problem

# unfactor them for now so we can replace spaces
Habitat$Field<-as.character(Habitat$Field)
Habitat$Map<-as.character(Habitat$Map)

# replace unwanted spaces
Habitat$Field <- gsub("[[:space:]]*$","",Habitat$Field) #gets ending spaces
Habitat$Map <- gsub("[[:space:]]*$","",Habitat$Map) #gets ending spaces
Habitat$Map <- gsub("[[:space:]]{2,}"," ",Habitat$Map) # gets double spaces
Habitat$Field <- gsub("[[:space:]]{2,}"," ",Habitat$Field) # gets double spaces

# factor them again
Habitat$Field <-factor(Habitat$Field)
Habitat$Map<-factor(Habitat$Map)
MattBagg
  • 10,268
  • 3
  • 40
  • 47
  • Hey, thanks so much for the response. It looks like that could be it from the example data, but when I run it through my own data I come up with the same examples and number of unique possibilities. I re-ran the example data through, adding more and more incorrectly classified species. I think it may be the number of consistent misclassifications. When I made 45 setosa = monsterosa the error was still small but the mismatch was huge. When I spread the misclassifications around the error went up and the mismatch went down. However, when I attempted to do that on my own data (collapsing classes – HeidelbergSlide Nov 12 '12 at 15:42
  • the mismatch value and error stayed constant throughout all merges (eventually collapsed 5 classes into 1, redoing the matrix on each collapse). So I dunno, but that very much for the help. – HeidelbergSlide Nov 12 '12 at 15:43
  • Can you dput(Habitat[somerows,c("Field","Map")]) where somerows is defined in a way that replicates the mismatch and paste the result into your question? That will make it much easier to give a better answer. – MattBagg Nov 12 '12 at 16:05
  • There we go, I think that was what you were asking for? – HeidelbergSlide Nov 12 '12 at 16:42
  • 1
    No, sadly. :-) The way you did it does not provide any information about the structure of the data (e.g., are they factors, if so what are the levels?). If you wrap an R object in the dput function it returns something that looks ugly, but which you can paste into R and exactly reproduce your R object. If you type somerows=seq(1,100) and then type the code above it will return something containing the first 100 rows of data for those two variables. That would be better. But maybe your problem is not in the first 100 rows, so use seq(120,220) or whatever. – MattBagg Nov 12 '12 at 16:54
  • 1
    See also http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – MattBagg Nov 12 '12 at 16:55
  • Oooh, ok, I think I understand. I took what it reproduced, thought it was ugly, put it into a data.frame and posted that. But I suppose that doesn't make any sense. So I think I've got it now. This is just a short segment, but it contains one of the most consistently incorrect classification, and a few correct ones. – HeidelbergSlide Nov 12 '12 at 17:20
  • You had some unwanted spaces. I added a fix to my answer. Let me know if there's a problem. – MattBagg Nov 12 '12 at 17:59
  • OH MAN! Thats exactly what it was. Thanks so much for working it through with me, I doubt I ever would have caught that. – HeidelbergSlide Nov 12 '12 at 18:17