Find Correlation between two columns that has range data

Question

Beginner level question.

I have data like the image above. I want to find the correlation between Height and Longevity.

Smaller breeds of dogs tend to live longer than larger breeds. Is there a way to establish this correlation and show it in plot (preferably with dog breed names as well) in R?

cor function is giving error because the height and longevity data is in range. Am not sure how exactly this can be done. Please help.

Thank you.

Code below to reproduce:

  list(
    Breed = c(
      "Labrador Retriever",
      "German Shepherd",
      "Bulldog",
      "Poodle",
      "Beagle",
      "Chihuahua",
      "Boxer",
      "Golden Retriever",
      "Pug",
      "Rottweiler"
    ),
    Country.of.Origin = c(
      "Canada",
      "Germany",
      "England",
      "France",
      "England",
      "Mexico",
      "Germany",
      "Scotland",
      "China",
      "Germany"
    ),
    Fur.Color = c(
      "Yellow, Black, Chocolate",
      "Black, Tan",
      "White, Red",
      "White, Black, Brown, Apricot",
      "White, Tan, Red, Lemon",
      "Black, Brown, Tan, White",
      "Fawn, Brindle",
      "Golden",
      "Fawn, Black",
      "Black, Tan"
    ),
    Height..in. = c(
      "21-24",
      "22-26",
      "12-16",
      "10-15",
      "13-15",
      "6-9",
      "21-25",
      "21-24",
      "10-14",
      "22-27"
    ),
    Color.of.Eyes = c(
      "Brown",
      "Brown",
      "Brown",
      "Brown, Blue",
      "Brown",
      "Brown, Blue",
      "Brown",
      "Brown",
      "Brown",
      "Brown"
    ),
    Longevity..yrs. = c(
      "10-12",
      "7-10",
      "8-10",
      "12-15",
      "12-15",
      "12-20",
      "10-12",
      "10-12",
      "12-15",
      "8-10"
    ),
    Character.Traits = c(
      "Loyal, friendly, intelligent, energetic, good-natured",
      "Loyal, intelligent, protective, confident, trainable",
      "Loyal, calm, gentle, brave",
      "Intelligent, active, affectionate, hypoallergenic",
      "Curious, friendly, energetic, good-natured",
      "Loyal, energetic, confident, sensitive",
      "Loyal, energetic, intelligent, playful, protective",
      "Intelligent, friendly, kind, loyal, good-natured",
      "Loyal, playful, affectionate, social, charming",
      "Loyal, protective, confident, strong"
    ),
    common_problem1 = c(
      "hip dysplasia",
      "hip dysplasia",
      "skin allergies",
      "hip dysplasia",
      "ear infections",
      "dental problems",
      "hip dysplasia",
      "hip dysplasia",
      "eye problems",
      "hip dysplasia"
    ),
    common_problem2 = c(
      "obesity",
      "elbow dysplasia",
      "respiratory issues",
      "epilepsy",
      "hip dysplasia",
      "eye issues",
      "cancer",
      "cancer",
      "respiratory issues",
      "cancer"
    ),
    common_problem3 = c(
      "ear infections",
      "pancreatitis",
      "obesity",
      "bladder stones",
      "epilepsy",
      "respiratory issues",
      "heart conditions",
      "skin allergies",
      "obesity",
      "elbow dysplasia"
    )
  ),
  row.names = c(NA, 10L),
  class = "data.frame"
))

I tried cor(Height..in., Longevity..yrs.). But it is giving me error. Not sure if this is the exact way to try.

Can you make your post [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and provide your data using `dput(name_of_data)` and any code you've tried so far, even if it ends in an error message? — jrcalabrese, Jan 04 '23 at 17:09
Please provide enough code so others can better understand or reproduce the problem. — Community, Jan 04 '23 at 17:12

score 2 · Accepted Answer · answered Jan 05 '23 at 13:43

Two options come to my mind regarding your problem, but they are both not optimal. Correlations can only be performed on numerical data. As far as I know, there is no possibility to directly perform a correlation on range data.

Option 1: Rank correlation

Spearman correlation or Kendalls Tau can both be used to estimate the relationship between ordinal variables by using their respective rank numbers.
For the variable Height..in. you have 9 unique values in your dataset, which ranges partly overlap. For the variable Longevity..yrs. you have 5 unique values in your dataset. Again, the ranges partly overlap. Despite the overlapping ranges, it is possible to rank the unique values.

I created two factors from these variables containing this information and added them to the dataframe. Note that I stored the dataset in the object data so I can reference the variables with the $ operator. If your dataset is called differently, you have to adjust the code accordingly.

data$factor_Height..in. <- factor(data$Height..in., order = TRUE, 
                                    levels = c("6-9", "10-14", "10-15","12-16", "13-15", "21-24", "21-25", "22-26", "22-27"),
                                    labels = c(1,2,3,4,5,6,7,8,9))
data$factor_Longevity..yrs. <- factor(data$Longevity..yrs., order = TRUE, 
                                    levels = c("7-10", "8-10", "10-12", "12-15", "12-20"),
                                    labels = c(1,2,3,4,5))

These two factors can then be used to calculate Spearmans rank correlation coefficient and Kendalls rank correlation test.

cor(as.numeric(data$factor_Height..in.), as.numeric(data$factor_Longevity..yrs.), method ="spearman")
cor(as.numeric(data$factor_Height..in.), as.numeric(data$factor_Longevity..yrs.), method ="kendall")

Option 2: (Mean) values instead of ranges You could also calculate the mean longevity and mean height values and then calculate the (default) Pearson correlation coefficient.

mean_Height..in. <- sapply(strsplit(as.character(data$Height..in.) , "-", 
                                  fixed = TRUE), function(x) sum(as.numeric(x)))
mean_Height..in. <- mean_Height..in. / 2
mean_Longevity..yrs. <- sapply(strsplit(as.character(data$Longevity..yrs.) , "-", 
                                    fixed = TRUE), function(x) sum(as.numeric(x)))
mean_Longevity..yrs. <- mean_Longevity..yrs. / 2
cor(mean_Height..in., mean_Longevity..yrs.)

Both spearman and kendall correlation coefficients of the ranked values and the pearson correlation of the averaged ranges lead to a negative correlation. So as you've expected your data reveals that the greater the dog, the smaller the lifespan.

Plot the data

A simple scatter plot can be used to display the relationship. Again, the ranges cannot be used, so we use the ranks instead.

plot(as.numeric(data$factor_Longevity..yrs.), as.numeric(data$factor_Height..in.))
text(data$factor_Longevity..yrs.,data$factor_Height..in., data$Breed)

Hope that helps!

Thank you for this amazing guide and clarifications. Everything worked fine except the plot. Am getting error as: "Error in plot.window(...) : need finite 'xlim' values In addition: Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf 3: In min(x) : no non-missing arguments to min; returning Inf 4: In max(x) : no non-missing arguments to max; returning -Inf" I guess this is happening because the correlation is giving N/A values for all. — Shashivydyula, Jan 05 '23 at 15:15
Maybe try using the mean of the averaged values for the plot. data$mean_Height..in. <- sapply(strsplit(as.character(data$Height..in.) , "-", fixed = TRUE), function(x) sum(as.numeric(x))) data$mean_Height..in. <- data$mean_Height..in. / 2 data$mean_Longevity..yrs. <- sapply(strsplit(as.character(data$Longevity..yrs.) , "-", fixed = TRUE), function(x) sum(as.numeric(x))) data$mean_Longevity..yrs. <- data$mean_Longevity..yrs. / 2 plot(data$mean_Longevity..yrs., data$mean_Height..in.) text(data$mean_Longevity..yrs., data$mean_Height..in.,data$Breed) — Maria-Christina Weber, Jan 05 '23 at 15:50
To clarify: To use the mean of the averaged ranges, these have to be stored in the dataset, so that the Breed names can be associated with the values. Therefore, you have to assign mean_Height..in. and mean_Longevity..yrs. to your dataset by using the $ operator. Then you should be able to use the plot function: plot(data$mean_Longevity..yrs., data$mean_Height..in.) and the text function to add the labels text(data$mean_Longevity..yrs., data$mean_Height..in.,data$Breed). Glad I could help! — Maria-Christina Weber, Jan 05 '23 at 16:04
Thank you so much. This worked as expected. All I need to do is certain filtering and cleaning. But this is what I was looking for. Thank you once again for helping a newbie. Much appreciated — Shashivydyula, Jan 05 '23 at 16:18

Find Correlation between two columns that has range data

1 Answers1