0

I have the following datatable:

library(tidyverse)

df <- data.frame(READS=rep(c('READa', 'READb', 'READc'),each=3) ,GENE=rep(c('GENEa', 'GENEb', 'GENEc'), each=3), COMMENT=rep(c('CommentA', 'CommentA', 'CommentA'),each=3))

> df
  READS  GENE  COMMENT
1 READa GENEa CommentA
2 READa GENEa CommentA
3 READa GENEa CommentA
4 READb GENEb CommentA
5 READb GENEb CommentA
6 READb GENEb CommentA
7 READc GENEc CommentA
8 READc GENEc CommentA
9 READc GENEc CommentA

I want to produce the following which works with a small dataframe.

df %>%
  count(READS, GENE) %>%
  pivot_wider(
    names_from = GENE, values_from = n,
    values_fill = list(n = 0)
  )

  A tibble: 3 x 4
   READS GENEa GENEb GENEc
   <chr> <int> <int> <int>
 1 READa     3     0     0
 2 READb     0     3     0
 3 READc     0     0     3

The input dataframe is very large 27748156 rows (roughly 27 million rows). With such a big table i get the following error.

Any idea how can i deal with such a big table ?

Error: Can´t index beyond the end of a vector.
The vector has length 1 and you´ve tried to submit element 712.
david
  • 805
  • 1
  • 9
  • 21
  • Could you keep a summarised output in the long format i.e. `df %>% count(READS, GENE)` – akrun Dec 17 '19 at 16:08
  • Curious, why do you need to reshape from ideally the long, ["tidy"](https://r4ds.had.co.nz/tidy-data.html) format to wide format? – Parfait Dec 17 '19 at 16:19
  • How about matrix assignment: https://rextester.com/KRFYMP23374? Inspired by @Aaron [here](https://stackoverflow.com/a/9617424/1422451). – Parfait Dec 17 '19 at 16:32
  • I want to be able to export the long format to a more human readable format. Although I may reconsider this and see if i can filter out some data before. This is an experiment with 800 samples which explains such big data. – david Dec 17 '19 at 16:37
  • are there any cases, where you would get counts for a gene in multiple Reads? – TobiO Dec 17 '19 at 16:55
  • Yes multiple reads can be present in multiple genes. How would you filter the long table such that reads counts present in at least 20% of the total number of genes are kept. For exemple if readA is present in at least 20% of the genes it is kept. – david Dec 17 '19 at 17:00

0 Answers0