I would like to identify all non-overlapping values between groups (factors) in a dataframe. Let's use iris
to illustrate. The iris
dataset has measurements of sepal length, sepal width, petal length, and petal width for three plant species (setosa, versicolor, and virginica). All three species overlap in measurements of sepal length and width. In measurements of both petal length and width, setosa doesn't overlap with both versicolor and virginica.
What I want can be easily visualized manually using a variety of functions such as range values or scatter plots:
tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)
# or
library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()
But it's impractical to do this manually for large datasets, so I'd like to write a function that identifies non-overlapping values between factors in dataframes like iris
. The output could be a list of matrices with TRUE or FALSE (indicating non-overlap and overlap, respectively), one for each variable in the dataset. For example, the output for iris
would be a list of 4 matrices:
$1.Sepal.Length
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$2.Sepal.Width
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$3.Petal.Length
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
$4.Petal.Width
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
I accept suggestions of different outputs, as long as they identify all non-overlapping values.