2

I would like to identify all non-overlapping values between groups (factors) in a dataframe. Let's use iris to illustrate. The iris dataset has measurements of sepal length, sepal width, petal length, and petal width for three plant species (setosa, versicolor, and virginica). All three species overlap in measurements of sepal length and width. In measurements of both petal length and width, setosa doesn't overlap with both versicolor and virginica.

What I want can be easily visualized manually using a variety of functions such as range values or scatter plots:

tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# or

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

But it's impractical to do this manually for large datasets, so I'd like to write a function that identifies non-overlapping values between factors in dataframes like iris. The output could be a list of matrices with TRUE or FALSE (indicating non-overlap and overlap, respectively), one for each variable in the dataset. For example, the output for iris would be a list of 4 matrices:

$1.Sepal.Length
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$2.Sepal.Width
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$3.Petal.Length
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

$4.Petal.Width
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

I accept suggestions of different outputs, as long as they identify all non-overlapping values.

goshawk
  • 73
  • 5

1 Answers1

3

this is one possible solution within the tidyverse

library(dplyr)

# build custom function
my_fun <- function(x){
    # build tibble from input data (colum with metric) and Species vector from iris
    myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
        # find min and max value per species
        dplyr::group_by(Species) %>%
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret <- myDf %>%
        # build full join from data
        dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>% 
        # convert operation to row wise
        dplyr::rowwise() %>% 
        # if species are the same generate NA else check if between  - I do negate here as if they are overlapping you want it to be FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
        # make tibble wide to get the wanted layout
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
        # need it to be able to set row names
        as.data.frame()

    # set row names from column
    row.names(ret) <- ret$Species_1
    # remove column used to name rows
    ret$Species_1 <- NULL
    return(ret)
}

purrr::map(iris[, 1:4], ~my_fun(.x))

$Sepal.Length
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Sepal.Width
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Petal.Length
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

$Petal.Width
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA
DPH
  • 4,244
  • 1
  • 8
  • 18
  • Thanks for your response! I don't understand the `tidyverse` syntax, and so I'd prefer to do this work using the R base package. But your suggestion may resolve my problem for now. I tried to run your code, but it returned the following error in the last line: `Error in tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, : object 'mini_1' not found` – goshawk Feb 16 '23 at 00:27
  • @goshawk there was a pipe (%>%) missing in my code... I just editted the answer accordingly and it should work... concercing a base R solution I will have to let you wait as I am very used to the tidyverse for data wrangling but not so much with base R though... – DPH Feb 16 '23 at 01:22
  • many thanks! This will work for now. I'll try to figure out how to write this function using base R just to understand more fully what the code is doing. – goshawk Feb 16 '23 at 13:08
  • there's just one more problem: when running the code in datasets that contain `NA` values, the output will have `NA` in the entire row and column for that species that contain `NA` values. How could I resolve this? That is, I want the code to "ignore" NA values when calculating ranges. – goshawk Feb 16 '23 at 16:39
  • 1
    Resolved - just use `na.rm = TRUE` within `min(Vals)` and `max(Vals)` – goshawk Feb 16 '23 at 17:02