0

I have a dataframe

# Create an empty dataframe to store the results
result_df <- data.frame()

test_data <- as.data.frame(list(V1 = c(6, 15, 14, 13, 5), 
                                V2 = c(14, 18, 5, 2, 16), 
                                V3 = c(11, 20, 19, 18, 10), 
                                V4 = c(22, 22, 11, 18, 15), 
                                V5 = c(-8, -12, 1, 4, -10)))

Then I check every column for correlation

# Loop through each column
for (i in 1:(ncol(test_data) - 1)) {
  for (j in (i + 1):ncol(test_data)) {
    # Compare columns (example: calculate mean difference)
    result <- cor(test_data[, i], test_data[, j])
    
    # Append the result to the result dataframe
    result_df <- rbind(result_df, 
                       data.frame(Column1 = names(test_data)[i], 
                                  Column2 = names(test_data)[j], 
                                  Result = result))  
  }
}

# Print the result dataframe
result_df$Result <- abs(result_df$Result)

And then I need to subset the res. For some reason it shows only one row, while there are two.

res <- result_df[result_df$Result==1,]

When I use isTRUE(all.equal(result_df$Result, 1)) the result is even worse and shows no data.

# res <- result_df[isTRUE(all.equal(result_df$Result, 1)),]

Why?

Lara
  • 129
  • 7
  • 2
    [1/2] Iteratively adding rows to a frame using `rbind(old, newrow)` works in practice but scales *horribly*, see "Growing Objects" in [The R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf). For each row added, it makes a complete copy of all rows in `old`, which works but starts to slow down a lot. It is far better to produce a list of these new rows and then `rbind` them at one time; e.g., `out <- list(); for (...) { out <- c(out, list(newrow)); }; alldat <- do.call(rbind, out);`. – r2evans Jun 08 '23 at 02:46
  • 2
    [2/2] The assumption that a floating-point can be matched with strict equality is a luxury that works sometimes and will silently fail. See https://stackoverflow.com/q/9508518/3358272, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f. – r2evans Jun 08 '23 at 02:47
  • 2
    (`== 1` with a `numeric` is not sustainable. Find a tolerance of difference from 1, as in `abs(Result - 1) < 1e-9` or similar. Oh, and stop doing iterative `rbind`, it's bad practice that works fine for so many rows but scales _horribly_.) – r2evans Jun 08 '23 at 02:49
  • 3
    Consider that `cor` accepts matrix and data.frame objects. So just `cor(test_data)` can replace your double for loop, and it's much more efficient. – nicola Jun 08 '23 at 04:45
  • 1
    Regarding your equality, `print(result_df, digits=22)` clearly shows why the subsetting doesn't work as you expect. – nicola Jun 08 '23 at 04:51

1 Answers1

0

from you code, I draw an inference that you need

  • get the absoluted corr matrix result.
  • get full or half(lower triangle) of the correlation matrix. So, I write code with two possible answer for you reference.
# get the lower triangle of corrleation matrix(exclude the diag)
corr_matrix <- cor(test_data)


# =====answer 1 - assuming  need full corr matrix   ==================

# convert it to long table as you descried format
result_full_matrix <- corr_matrix|> 
  as_tibble(rownames='column1')|> 
  pivot_longer(cols=-c('column1'), names_to='column2',values_to = 'result') |>
  mutate(result = abs(result))|>
  filter(!is.na(result))|>
  # filter those element at upper triangle which marked as NA in previously step.
  arrange(column1, column2) |> print()
# arrange  according to columns 1,2 id



# =====answer 2 - assuming  need only lower triangle corr matrix=======
corr_matrix[upper.tri(corr_matrix,diag=TRUE)]=NA

# convert it to long table as you descried format
result_triangle_matrix <- corr_matrix|> 
  as_tibble(rownames='column1')|> 
  pivot_longer(cols=-c('column1'), names_to='column2',values_to = 'result') |>
  mutate(result = abs(result))|>
  filter(!is.na(result))|> print()
  # filter those element at upper triangle which marked as NA in previously step.
  arrange(column1, column2)
  # arrange  according to columns 1,2 id

If you like it , pls vote it up.

WY

Yong Wang
  • 1,200
  • 10
  • 15