0

Given the following dataframe:

df <- structure(list(OTU = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
4L, 5L), class = "factor", .Label = c("OTU_1", "OTU_2", "OTU_3", 
"OTU_4", "OTU_5")), read = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 
5L, 5L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor")), class = "data.frame", row.names = c(NA, 
-25L))

     OTU read
1  OTU_1    a
2  OTU_2    a
3  OTU_3    a
4  OTU_4    a
5  OTU_5    a
6  OTU_1    b
7  OTU_2    b
8  OTU_3    b
9  OTU_4    b
10 OTU_5    b
11 OTU_1    c
12 OTU_2    c
13 OTU_3    c
14 OTU_4    c
15 OTU_5    c
16 OTU_1    d
17 OTU_2    d
18 OTU_3    d
19 OTU_4    d
20 OTU_5    d
21 OTU_1    e
22 OTU_2    e
23 OTU_3    e
24 OTU_4    e
25 OTU_5    e

I would like to create an new dataframe as follows:

       a    b   c   d   e
OTU_1  1    1   1   1   1
OTU_2  1    1   1   1   1

....

It´s not a perfect example because the values of the dataframe are always 1 but in my dataframe you have different number of letters.

How can i do that very fast as my dataframe is quite big (1.5M rows)??

Thanks

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
david
  • 805
  • 1
  • 9
  • 21
  • `library(data.table); setDT(df)[, dcast(.SD, OTU ~ read, fun.aggregate = length)]` – markus Oct 16 '18 at 15:32
  • 1
    @markus The aggregation function of `dcast` always defaults to `length()` when none is specified and there is more than one observation per cell. [See also here](https://stackoverflow.com/q/33051386/2204410). – Jaap Oct 16 '18 at 15:39
  • setDT(mybigdataframe)[, dcast(.SD, OTU ~ read, value.var="OTU",fun.aggregate = length)] Error in CJ(1:27677, 1:1231357) : Cross product of elements provided to CJ() would result in 34080267689 rows which exceeds .Machine$integer.max == 2147483647 – david Oct 16 '18 at 15:44
  • I have 27677 unique OTU – david Oct 16 '18 at 15:45
  • @Jaap Good point. For the example data given in the question - with only one observation per cell - I just wanted to illustrate that `dcast` is working the way OP expected. – markus Oct 16 '18 at 15:49
  • @david try without `value.var="OTU"` and `fun.aggregate = length` should also not be necessary, see Jaap's comment. – markus Oct 16 '18 at 15:49
  • > setDT(swarms)[, dcast(.SD, OTU ~ read,fun.aggregate = length)] Using 'OTU' as value column. Use 'value.var' to override Error in CJ(1:27677, 1:1231357) : Cross product of elements provided to CJ() would result in 34080267689 rows which exceeds .Machine$integer.max == 2147483647 – david Oct 16 '18 at 15:50
  • `table(swarms)` ? – markus Oct 16 '18 at 15:51
  • ``> table(swarms) Error in table(swarms) : attempt to make a table with >= 2^31 elements`` – david Oct 16 '18 at 15:52
  • Note that swarms in my case is already a data table ``> dim(swarms) [1] 1231357 2`` – david Oct 16 '18 at 15:53
  • How many unique values has `read`? I.e. what is the outputnumber of `uniqueN(swarms, by = "read")`? – Jaap Oct 16 '18 at 16:03
  • That´s it, i have a problem since they are all unique !!!!! thanks – david Oct 16 '18 at 16:17

0 Answers0