1

To find the relationship between two columns of the iris dataset, I am performing kruskal.test and p.value shows a meaningful relationship between these two columns.

data(iris)
kruskal.test(iris$Petal.Length, iris$Sepal.Width)

Here are the results:

    Kruskal-Wallis rank sum test

data:  iris$Petal.Length and iris$Sepal.Width
Kruskal-Wallis chi-squared = 41.827, df = 22, p-value = 0.00656

The Scatter plot also shows some sort of relationship. plot(iris$Petal.Length, iris$Petal.Width)

enter image description here

To find the meaningful boundaries of these two variables, I ran pairwise.wilcox.test test, but for this test to work, one of the variables needs to be categorical. If I pass both continuous variables to it, then the results are not as expected.

pairwise.wilcox.test(x = iris$Petal.Length, g = iris$Petal.Width, p.adjust.method = "BH")

As an output, I need a clear cut point where these two variables have some sort of relationship and where this relationship ends (As shown through the red line in the attached image above)

I am not sure if there is any statistical test or another programming technique to find these boundaries.

e.g. manually I can do something like this to mark boundaries -

setDT(iris)[, relationship := ifelse(Petal.Length > 3 & Sepal.Width < 3.5, 1, 0)]

But, is there a programming technique or library in R to find such boundaries?

It is important to note that my actual data is skewed.

Thanks, Saurabh

Saurabh
  • 1,566
  • 10
  • 23
  • This doesn't appear to be a specific programming question that's appropriate for Stack Overflow. If you seek recommendations for statistical methods, then you should ask such questions over at [stats.se] instead. You are more likely to get better answers there. – MrFlick Sep 22 '20 at 16:58
  • 1
    https://stackoverflow.com/questions/30619616/how-to-plot-classification-borders-on-an-linear-discrimination-analysis-plot-in example on stack for LDA – polkas Sep 22 '20 at 18:08
  • 1
    add this for unsupervised https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o – polkas Sep 22 '20 at 18:33

1 Answers1

1

There is not sth like the best split. It could be the best under certain conditions/criteria you will specify.

I think you expected second plot although I added the first one too where you have one line. There is used a Linear Discriminant Analysis. However this is supervised learning as we have Species column. So you might be interested in unsupervised methods like K-Nearest Neighborhoods and boundaries for them - then check this one https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o.

data(iris)
library(MASS)

plot(iris$Petal.Length, iris$Petal.Width, col = iris$Species)

# construct the model
mdl <- lda(Species ~ Petal.Length + Petal.Width, data = iris)

# draw discrimination line
np <- 300
nd.x <- seq(from = min(iris$Petal.Length), to = max( iris$Petal.Length), length.out = np)
nd.y <- seq(from = min(iris$Petal.Width), to = max( iris$Petal.Width), length.out = np)
nd <- expand.grid(Petal.Length = nd.x, Petal.Width = nd.y)

prd <- as.numeric(predict(mdl, newdata = nd)$class)

plot(iris[, c("Petal.Length", "Petal.Width")], col = iris$Species)
points(mdl$means, pch = "+", cex = 3, col = c("black", "red"))
contour(x = nd.x, y = nd.y, z = matrix(prd, nrow = np, ncol = np), 
        levels = c(1, 2), add = TRUE, drawlabels = FALSE)

#create LD sequences from min - max values 
p = predict(mdl, newdata= nd)
p.x = seq(from = min(p$x[,1]), to = max(p$x[,1]), length.out = np) #LD1 scores
p.y = seq(from = min(p$x[,2]), to = max(p$x[,2]), length.out = np) #LD2 scores


contour(x = p.x, y = p.y, z = matrix(prd, nrow = np, ncol = np), 
        levels = c(1, 2, 3), add = TRUE, drawlabels = FALSE)

enter image description here enter image description here

Linked to: How to plot classification borders on an Linear Discrimination Analysis plot in R

polkas
  • 3,797
  • 1
  • 12
  • 25
  • Thanks for the answer. My question is not about the visual/ graphical split of data, but about the most optimal split points that best divide the data. LDA model does suggest these split points but it does not work perfectly for skewed data. – Saurabh Sep 22 '20 at 18:47
  • 1
    The KNN might be a better solution or SVM with some specific kernel. There is not sth like the best split. It could be the best under certain conditions/criteria you specify. – polkas Sep 22 '20 at 19:20