1

Data
We have numerous text strings that look like this (way longer in our real dataset):

df <- data.frame(
  id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)


>df
     id text
1 text1  ABA
2 text2  ABA
3 text3  AAA

We want to create a matrix that tells how often a letter at position x is found together with the other letters at other positions, so in this case:

3A  3   1   2   3
2B  2   0   2   2
2A  1   1   0   1
1A  3   1   2   3
    1A  2A  2B  3A

What I tried

I previously converted the matrix to a binary matrix, looking like this:

structure(list(pos1_A = c(1, 1, 1), pos2_A = c(0, 0, 1), pos2_B = c(1, 
1, 0), pos3_A = c(1, 1, 1)), class = "data.frame", row.names = c("text1", 
"text2", "text3"))

      pos1_A pos2_A pos2_B pos3_A
text1      1      0      1      1
text2      1      0      1      1
text3      1      1      0      1

Then I can run commands like cor to get correlations, however, instead of correlations I want the frequencies.


Note this is different from questions about co-occurrences wherein the variable name itself (here position) is neglected, for example like "How to use R to create a word co-occurrence matrix"

CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • I don't understand how you get your expected output? What is `1A`, `3A` ? – Ronak Shah May 24 '20 at 10:42
  • @RonakShah The first one is the position (the number) and the second one is the letter, I'm open to suggestions on how to represent this though! – CodeNoob May 24 '20 at 10:43
  • How long can the strings be? – dvd280 May 24 '20 at 10:49
  • @dvd280 they are 53 characters now (all equal size) – CodeNoob May 24 '20 at 10:52
  • Yeah this is not going to be possible i think... the number of possible combinations is at most 2^53 or 9,007,199,254,740,992. i mean you just had 3 combinations produce `2**(n+1)` cells. you can see where this is going. Even if you had the compute to pull it off, writing something like this will be difficult – dvd280 May 24 '20 at 11:04
  • @dvd280 In our case, there are max 4 values (let's say `A, B, C, D`) per position, so when we construct a binary matrix like: `1_A, 1_B, 1_C, 1_D...` we would have 53*4 = 212, thus a 212 x 212 matrix. I made this binary matrix before to create a correlation matrix (I can add the code if that helps?), however instead of correlation values we now want the frequency – CodeNoob May 24 '20 at 11:17
  • you have 53 positions per string, in each position you can have either A or B or C or D. Its true that your matrix would be 212*212, but in order to calculate it you would have to parse and evaluate 4**53 terms, your problem is more time complexity rather than space complexity. the act of identifying combinations of positions is basically the dna sequencing problem, you can read about it and see how that relates. – dvd280 May 24 '20 at 11:26
  • @dvd280 I get that, but a correlation calculation on the 212x212 matrix finishes nearly instantly, this also requires a comparable comparison right? – CodeNoob May 24 '20 at 11:29
  • I don't know how you calculate correlation, in a 212*212 matrix you have 212*212 observations. in the above problem you have more than trillions of potential observations. Its simple combinatorics. – dvd280 May 24 '20 at 11:30
  • Also, to demonstrate i suggest you try to do a simple version - take 10 strings of length 5 and try to calculate it by hand, if you think its hard, try 3 strings of 10 characters. – dvd280 May 24 '20 at 11:33
  • @dvd280 see edit (obviously this does not return a useful result for such as small matrix, but it as quick for the 212x212 matrix) – CodeNoob May 24 '20 at 11:33
  • 1
    @RonakShah figured it out with your other answer! – CodeNoob May 24 '20 at 11:51

2 Answers2

1

Huge credit to @Ronak Shah with the answer here


It's much simpler if we convert the categorical data to a numerical (binary matrix), for example using this hacky but easy way with the homals package and then apply the method by @Ronak Shah linked above:

# The dataset
df <- data.frame(
  id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)

# Split the strings in characters and add column names
df2 <- df %>% splitstackshape::cSplit('text', sep = '', stripWhite = FALSE, type.convert = FALSE, direction = 'wide') %>%
  column_to_rownames('id')
colnames(df2) <- paste0('pos', 1:ncol(df2))

# Convert to binary matrix (hacky way)
bin.mat <- homals:::expandFrame(df2, clean = F)

# Method by @Ronak Shah to get the frequency matrix
fun <- function(x, y) sum(bin.mat[, x] & bin.mat[, y])
n <- seq_along(bin.mat)
mat <- outer(n, n, Vectorize(fun))
dimnames(mat) <- list(names(bin.mat)[n], names(bin.mat[n]))

This produces the matrix:

>mat
       pos1_A pos2_A pos2_B pos3_A
pos1_A      3      1      2      3
pos2_A      1      1      0      1
pos2_B      2      0      2      2
pos3_A      3      1      2      3
CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • be sure to let us know how fast it runs on your 53 string long data my friend. – dvd280 May 24 '20 at 11:53
  • @dvd280, the binary matrix: 4320 rows with 212 columns took only 2.6 seconds! :) – CodeNoob May 24 '20 at 12:11
  • @dvd280 compare the matrix in my question to the one in my answer and you'll see they are the same already. I couldn't take long anyway as correlation analysis is similar and that also was really quick. I already tested it and only took 2.6 so I don't get what you try to justify if this is exactly what I wanted – CodeNoob May 24 '20 at 12:15
  • @dvd280 Really don't understand what you are trying to achieve. I don't mind converting it to a binary matrix if that gives me the desired end results, and it only expands my original times 4 - end of discussion – CodeNoob May 24 '20 at 12:20
  • The question originally asked for a non-binary matrix, but one that only requires the addition of _n_ conforming _m_ x _m_ matrices, where _n_ is the number of rows in the original data frame and _m_ is the product of the maximum string length and the number of unique characters. For a string length of 53 where all uppercase and lower case letters are used, this requires about 10 million elementary calculations per row. It's a lot, but not unfeasible @dvd280 as my solution shows – Allan Cameron May 24 '20 at 13:03
  • @AllanCameron he clearly stated all strings are of length 53... – dvd280 May 24 '20 at 13:22
  • Yes, @dvd280, he did. That's exactly what my comment says. – Allan Cameron May 24 '20 at 13:35
  • @AllanCameron you calculated for 2 letters, how does this scale to 3? 4? from the way OP stated his problem, he is trying to do something like DNA sequencing which is NP hard. you say this is feasable? increasing the number of letters from 2 to 3,will increases your operations per row from 10 million to over a trillion... per row... feasable... – dvd280 May 24 '20 at 13:42
  • @dvd280 I show the calculation of how it expands above. There are 53 possible characters at position 1 (all uppercase letters, all lowercase letters or a space), then 53 possible letters at position 2, and so on. There are therefore 53 * 53 possible labels. The binary matrix is therefore (53 * 53) * (53 * 53), and there will be one matrix for each string in your data frame. You seem to be under the impression that every different _ordering_ must be computed, which of course would lead to combinatorial explosion as you describe, but that is **not** what the OP asked for or implied. – Allan Cameron May 24 '20 at 14:07
  • @dvd280 the 10 million calculations per row I referred to were for the 53 character strings, and I did say that explicitly in my comment. My answer to the dummy test data only needs 48 calculations to sum the binary matrices. Please, if you are going to make strong assertions in the comments, take care to read and understand what is being said first. I know your intentions are good, but you have simply misunderstood both the question and my comments here. – Allan Cameron May 24 '20 at 14:21
  • Allan, have you tried to create a dataframe with 100 rows of character strings of length 53 and run this code on them? – dvd280 May 24 '20 at 14:33
1

Here's an alternative approach that produces a matrix as originally requested:

# Make all strings the same length:
df$text <-  stringr::str_pad(df$text, side = "right", max(nchar(df$text)))

# Create a matrix with all letters labelled by their position:
all_vals <- apply(do.call(rbind, strsplit(df$text, "")), 1, 
                  function(x) paste0(seq_along(x), x))

# Create a vector of all possible letter / position combos
all_labs <- do.call(paste0, expand.grid(seq(max(nchar(df$text))),
                                        unique(unlist(strsplit(df$text, "")))))

# Create a function that will count all co-occurences per data frame row
f <- function(y, x) as.vector(outer(x, x, function(a, b) 1 * (a %in% y & b %in% y)))

# Create the results matrix and label it
m <- matrix(rowSums(apply(as.data.frame(all_vals), 2, f, all_labs)), nrow = length(all_labs))
rownames(m) <- all_labs
colnames(m) <- all_labs
m
#>    1A 2A 3A 1B 2B 3B
#> 1A  3  1  3  0  2  0
#> 2A  1  1  1  0  0  0
#> 3A  3  1  3  0  2  0
#> 1B  0  0  0  0  0  0
#> 2B  2  0  2  0  2  0
#> 3B  0  0  0  0  0  0

Created on 2020-05-24 by the reprex package (v0.3.0)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87