Convert a row into a combine, c() as a vector in r and then use vectors to calculate the cosine similarity

Question

Hello I have a very large data frame and it is a partial part:

v1 <- c('i1', 'i10', 'i11')
v2 <- c(0.11, 0.07, 0.114)
v3 <- c(0.07, 0.08, 0.03)
df <- data.frame(cbind(v1, v2, v3))

How can I write some codes to convert each row into a combined vector, x <- c()?

that is, my expected output should be and the variable names need to be from column V1 :

i1 <- c(0.11014318, 0.07302843, 0.01360761, 0.10619829, 0.14513045)
i10 <- c(0.07360007, 0.08013833, 0.13104657, 0.13174247, 0.14256615)
i11 <- c(0.11418245, 0.03300573, 0.11425297, 0.13686428, 0.03367279)

After converting each row into a vector, I need to compute the cosine similarity among these vectors so that's why I need to split each row and save them as vectors with names from the first column V1.

library(lsa)
cosine(i1, i10)
cosine(i1, i11)
cosine(i10, i11)

The following question

Hello SamR. Thanks for your kind help but I do not know why it does not work when adding more columns V4 and V5 and one more row with the ID i12? Thanks so much for your patience and help.

data_matrix <- function(df){
  data_matrix  <- tail(t(df), -1) |>
    sapply(as.numeric) |>
    matrix(
        nrow = ncol(df)-1, 
        ncol = nrow(df), 
        dimnames = list(
            seq_len(nrow(df)-1), # rows
            df[,1] # columns
        )
    ) 
}

v1 <- c('i1', 'i10', 'i11', 'i12')
v2 <- c(0.11, 0.07, 0.114, 0.67)
v3 <- c(0.07, 0.08, 0.03, 087)
v4 <- c(0.12, 0.13, 0.14, 0.18)
v5 <- c(0.19, 0.21, 0.22, 0.22)
df <- data.frame(cbind(v1, v2, v3, v4, v5))
df

data_matrix(df)

It just returns the error:

Error in matrix(sapply(tail(t(df), -1), as.numeric), nrow = ncol(df) -  : 
  length of 'dimnames' [1] not equal to array extent

the idea behind a data frame is to organize similar items with same properties. The bigger your data the more you profit from not splitting rows apart. — danlooo, May 03 '22 at 07:35
Thanks for your comment but I need to calculate the cosine similarities for every vector. — Fox_Summer, May 03 '22 at 14:43

Maël · Answer 1 · 2022-05-03T07:48:33.997

2

You can use and split or asplit to split the rows, with setNames to set names of the list elements with your first column, and then use list2env to add elements of the list to the global environment:

l <- setNames(split(df[-1], seq(nrow(df))), df[,1])

# $i1
#     v2   v3
# 1 0.11 0.07
# 
# $i10
#     v2   v3
# 2 0.07 0.08
# 
# $i11
#      v2   v3
# 3 0.114 0.03

list2env(l, .GlobalEnv)

other splitting options include asplit and row:

asplit(df[-1], 1)
split(df[-1], row(df[-1])[, 1])
as.list(as.data.frame(t(df[, -1])))

edited May 03 '22 at 07:48

answered May 03 '22 at 07:33

Maël

45,206
3
29
67

1

`list2env(l, .GlobalEnv)` to build objects to the global environment. – Darren Tsai May 03 '22 at 07:41
1

When I use this function, it by default builds it to the global env, but better be specific. Good input – Maël May 03 '22 at 07:43
Good point! `list2env` of my R version(4.1.2) defaults to build objects to a new environment, not global. – Darren Tsai May 03 '22 at 07:51
1

Nice answer, +1! We can also use `list2env(by(df, v1, `[`, -1), .GlobalEnv)` if we prefer code-golfing. – ThomasIsCoding May 03 '22 at 08:17

SamR · Accepted Answer · 2022-05-03T17:51:20.000

1

Another approach would be to use apply over each row, which allows you to set the environment directly:

apply(df, 1, function(x) assign(x[1], tail(x, -1), envir = globalenv()))

However I agree with @danlooo's comment: I can't think of any reason that you would want to do this.

Edit: how to calculate cosine similarity matrix (following comment)

If you want to calculate a cosine similarity matrix it's better to start off with a matrix than to clutter up your global environment, and then have to do a potentially large combination of pairwise calculations.

First get the data into the right format, a numeric matrix with column names which are the first column of your data frame:

data_matrix  <- tail(t(df), -1) |>
    sapply(as.numeric) |>
    matrix(
        nrow = ncol(df) - 1, 
        ncol = nrow(df), 
        dimnames = list(
            seq_len(ncol(df)-1), # rows
            df[,1] # columns
        )
    ) 

data_matrix
#     i1  i10   i11
# 1 0.11 0.07 0.114
# 2 0.07 0.08 0.030

Then it is straightforward to calculate the cosine similarity:


library(lsa)
cosine(data_matrix)

#            i1       i10       i11
# i1  1.0000000 0.9595950 0.9525148
# i10 0.9595950 1.0000000 0.8283488
# i11 0.9525148 0.8283488 1.0000000

edited May 03 '22 at 17:51

answered May 03 '22 at 07:38

SamR

8,826
3
11
33

Thanks! The reason why I want to do it is that I need to calculate the cosine similarity among them. – Fox_Summer May 03 '22 at 14:47
1

@Fox_Summer Now that I know what you are trying to do, I have updated my response with a better way to do it! – SamR May 03 '22 at 15:28
Hello SamR, I have another question and please see my update above. Why does data_matrix not work when I add two more columns and one row? Thanks so much for your help and patience. – Fox_Summer May 03 '22 at 17:34
I made a mistake in the `dimnames` argument. It should be `ncol(df)-1` not `nrow(df)-1`. I have updated the answer. I didn't notice because it had the same number of rows as columns. Also - I am assuming you are going to do something else in the function, or return `data_matrix`? Otherwise the variable disappears when the function ends. – SamR May 03 '22 at 17:53
Yes and I am going to return ```data_matrix``` since I have different two datasets. I just do not want to repeat the chunk of code. :) Again, thanks so much. – Fox_Summer May 03 '22 at 18:53

iago · Answer 3 · 2022-05-03T09:12:58.043

Another variation of previous answers:

lapply(seq_len(nrow(df)), \(.) assign(df$v1[.], unlist(df[.,-1]), envir = .GlobalEnv))

That is, for each (lapply) row (seq_len(nrow(df)), \(.)), transform all the columns up to the first into vectors (unlist(df[.,-1])), and then assign those vectors to the first column strings (unlist(df[.,-1])) in the global environment (envir = .GlobalEnv).

And faster, improving also @SamR solution (in which transforming the df to an array, all numeric data become character):

list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv)

But not faster than @Maël solutions

v1 <- paste0("i", 1:1e+3)
lapply(2:200, \(.) assign(paste0("v", .), rnorm(1e+3), envir = .GlobalEnv))
df <- do.call("data.frame", args = sapply(ls(pattern = "^v\\d+$"), get, envir = .GlobalEnv, simplify = FALSE))
microbenchmark::microbenchmark(
    list2env(setNames(as.list(as.data.frame(t(df[, -1]))), df[, 1]), .GlobalEnv), 
    list2env(setNames(asplit(df[-1], 1), df[, 1]), .GlobalEnv), 
    list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv), 
    check = "equal")
Unit: milliseconds
                                                                                          expr      min       lq     mean   median       uq      max neval
             list2env(setNames(as.list(as.data.frame(t(df[, -1]))), df[, 1]), .GlobalEnv) 5.548269 5.731607 9.444446 5.864418 6.114002 37.83762   100
                               list2env(setNames(asplit(df[-1], 1), df[, 1]), .GlobalEnv) 7.421431 7.568999 9.336666 7.639897 7.800458 31.90791   100
 list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv) 8.031275 8.201781 9.796997 8.332828 8.512478 34.35403   100

The other solutions by @Maël (using split(df[-1], seq(nrow(df))) and split(df[-1], row(df[-1])[, 1])) and the solution by @benson23 setNames(lapply(1:nrow(df), function(x) df[x, -1]), df[, 1]) produce data.frame outputs instead of vectors.

Convert a row into a combine, c() as a vector in r and then use vectors to calculate the cosine similarity

3 Answers3

Edit: how to calculate cosine similarity matrix (following comment)