0

I have a data set similar to this:

x <- sample(c("A", "B", "C", "D", "E"), 50, replace=TRUE, prob=c(0.1, 0.2, 0.4, 0.25, 0.05))
y <- sample(c("xyz", "mno", "abc", "def", "hkl", "opq", "rst", "ghi", "fgh", "vwx"), 50, replace=TRUE, prob=c(0.2, 0.1, 0.07, 0.03, 0.1, 0.05, 0.05, 0.1, 0.15, 0.15))
d <- data.frame(x,y)

'data.frame':   50 obs. of  2 variables:
$ x: Factor w/ 5 levels "A","B","C","D",..: 4 1 4 2 3 4 1 5 4 2 ...
$ y: Factor w/ 10 levels "abc","def","fgh",..: 5 1 5 4 4 9 5 9 10 6 ...

table(d)
   y
x   abc def fgh ghi hkl mno opq rst vwx xyz
A   1   0   1   1   2   0   0   0   0   3
B   0   1   1   1   0   1   0   0   3   3
C   1   0   1   5   0   3   0   1   2   2
D   1   1   0   1   3   0   1   0   4   2
E   0   0   0   0   2   0   1   0   1   0

Now i would like to find out the percentage of how many of the y are in one of the values in x. So something like this

   A   B   C   D   E
A   1  .3  .4  .4  .1
B  .3   1  .5  .4  .1
C  .4  .5   1  .4  .1
D  .4  .4  .4   1  .3
E  .1  .1  .1  .3   1

Or alternatively

   A   B   C   D   E
A  10  3   4   4   1
B  3   10  5   4   1
C  4   5   10  4   1
D  4   4   4   10  3
E  1   1   1   3  10

Do you know of any ways to do that?

Arthur Pennt
  • 155
  • 1
  • 14
  • 3
    I don't understand "the percentage of how many of the y are in a one of the values in x". Can you please provide an example? – Roland Aug 24 '16 at 12:07
  • Yeah i know. A complicated sentence...might edit it. What i want to reach is presented under that sentence. The share of for instance A with B of the y values. So we see, with `table(d)` that `A` and `B` have bot `fgh`, `ghi`, and `xyz`. That are three occurences out of ten, so 30%. Thats the value .3 at the table at the bottom. Hope that made that clearer.. – Arthur Pennt Aug 24 '16 at 12:16
  • 2
    See `?crossprod` as -e.g.- [here](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses) -- `tcrossprod(table(d) > 0)` – alexis_laz Aug 24 '16 at 12:24
  • Ah great. that worked. I did not knew that command! Thank you! – Arthur Pennt Aug 24 '16 at 12:34
  • Hm. I tried it on my real dataset and it gave absolutely absurd values of 100,000,000,000 and more. This cannot be true since i have just 42,590 i want to perform the `tcrossproduct`on. – Arthur Pennt Aug 24 '16 at 12:46
  • @8bytez : Did you pass `table(.) > 0` to `tcrossprod` or `table(.)`? – alexis_laz Aug 24 '16 at 13:28
  • I forgot it first, but i tried it now. It gives me the following warning message: `Error: cannot allocate vector of size 13.5 Gb` ^^. I tried it differently and that worked. I used `xy <- Reduce(intersect, list(x,y))`. That worked for me. – Arthur Pennt Aug 24 '16 at 17:29
  • @8bytez : I'm not sure how you applied `intersect` this way to get pairwise comparisons, but, perhaps, you could try the "Matrix" package to reduce the memory needed. The sparse equivalent of the above is `Matrix::tcrossprod(xtabs(~ x + y, sparse = TRUE) > 0L)` – alexis_laz Aug 25 '16 at 09:21

0 Answers0