0

This problem confused me for a while and I'll try to describe as clearly as possible:

I would like to calculate correlations of sales between each pair of hotels, and here is my data:

hotel = c("A","A","B","C","C","D") 
Staydate = c(1:6)
cbind(hotel, Staydate, sales)

Each hotel might have multiple sales because the staydate is different. I would like to create a matrix calculating the correlation between hotels.

Shall I break the data.frame down to each hotel first? Or is there any method I can aggregate by hotel first, then calculate correlation?

My expected results would be the correlations between each pair of hotels. A matrix visualization would be very helpful! For example, in the following image, change sepal length, sepal width, petal length, and petal width into hotel A, B, C, D.

enter image description here

Thanks!!

s_baldur
  • 29,441
  • 4
  • 36
  • 69
Rachel Zhang
  • 562
  • 6
  • 20

2 Answers2

0

If you're asking how to aggregate then this is a duplicate of R - Aggregate (sum) totals for grouped data using dplyr

require(dplyr) # For the group_by and summarise functions
require(magrittr) # To pipe all this together
hotel = c("A","A","B","C","C","D")
Staydate = c(1:6)
sales = c(209,191,225,172,189,214)

# Store it as a data frame in order to perform aggregation
df <- data.frame(hotel, Staydate, sales)
df[,c("Staydate","sales")] <- as.numeric(as.factor(unlist(df[,c("Staydate","sales")]))) # To convert the columns from factors to numeric

df_agg <- df %>% group_by(hotel) %>% summarise(Avg_sales = mean(sales))

Output is an aggregated data frame

# A tibble: 4 x 2
hotel Avg_sales
<fct>     <dbl>
1 A           9.5
2 B          12  
3 C           7.5
4 D          11  

Just convert back to a matrix to perform the correlation

matrix <- as.matrix(df_agg)
TheSciGuy
  • 1,154
  • 11
  • 22
  • Hi Nick, thanks for your answer! Yes, part of my question is how to aggregate my data since I want to calculate the correlation between each pair of hotels. You were using group_by and summarize function to get the average sales for each hotel, but I'm wondering is it possible to keep the original sales, then get the correlation? For instance, hotel A might have 4 observations in sales, and I would like to calculate its correlation with the 4 sales obs with hotel B. Thanks!! – Rachel Zhang Jul 16 '18 at 13:23
  • Hi Rachel, you'll want to restructure your data frame from long format to wide format. This way you'll have columns for A, B, C, and D, which will contain sales in each column. If you don't care about Staydate in the matrix, I would subset it out and then run your correlations with a matrix consisting of only the columns A, B, C, and D. https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format. P.S. If the answers provided helped you, make sure to accept one of them. – TheSciGuy Jul 16 '18 at 17:50
0

The code below is just a repetition of the code in the Cross Validated answer to this question.

First of all, you have created your dataset with cbind. don't do that. As you can see in the image all variables became of class character. This is because cbind outputs a matrix, what you want is a data.frame.

hotel = c("A","A","B","C","C","D")
Staydate = c(1:6)
sales <- c(209.134, 191.832, 225.115, 172.442, 189.457, 214.521)

dat <- data.frame(hotel, Staydate, sales)

Now for the correlation.

model <- lm(sales ~ hotel, dat)

rsq <- summary(model)$r.squared
rsq
sqrt(rsq)
#[1] 0.9175044

cor(dat$sales, model$fitted)
#[1] 0.9175044

Like it's explained in the CV question of the link above, since you are trying to correlate a continuous variable and a categorical one, the steps are:

  1. Fit a linear model regressing the continuous variable, sales on the factor, hotel.
  2. Then you can compute the correlation in two ways.
    • Take the square root of R².
    • Compute the correlation between the sales and the fitted values of the model.
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Hi Rui, thank you for your answer! I might not express my question properly, but I would like to know what's exactly the correlation of sales between hotels. Say the correlation of sales between A and B is 0.9 and between B and C is 0.8 something like that. Thanks! – Rachel Zhang Jul 16 '18 at 12:41
  • @RachelZhang I don't believe the question posed like that makes much sense. Can you give a reference of how to compute such a correlation? – Rui Barradas Jul 16 '18 at 14:48
  • There's a problem with that, your dataset is in long format. If I [reshape data from long to wide in R](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) there will be lots of `NA` values and `cor` will be a matrix of `NA`'s. – Rui Barradas Jul 16 '18 at 20:21