dcast.data.table issue with large data and small decimal values

Question

I used functions from this answer to read multiple files and create a data table. I wanted to have the FileNames in different columns and for each variable that it doesn't exist to other "FileNames" to fill it with 0

part of dataset:

    dput(dt[1:4])
structure(list(FileName = c("Sample_4C_NaIO4", "Sample_4C_NaIO4", 
"Sample_4C_NaIO4", "Sample_4C_NaIO4"), smallRNA = c("TCGTACGACTCTTAGCGG", 
"GTACGACTCTTAGCGG", "CTCGTACGACTCTTAGCGG", "CGTACGACTCTTAGCGG"
), counts = c(4166178L, 564940L, 89932L, 52670L)), class = c("data.table", 
"data.frame"), row.names = c(NA, -4L), .internal.selfref = <pointer: 0x180a460>)

my code:

temp <- list.files(pattern = ".txt")
dt <- rbindlist( sapply(temp,fread,simplify=FALSE),
use.names = TRUE, idcol = "FileName")
dt$FileName <- gsub(".txt","",dt$FileName)
finaldt <- dcast.data.table(dt, smallRNA+counts ~FileName,
drop=FALSE,fill=0)

result:

    finaldt <- dcast.data.table(dt,smallRNA+counts ~ FileName,drop = FALSE,fill = 0)
Using 'counts' as value column. Use 'value.var' to override
Error in CJ(smallRNA = c("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAA", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAG",  : 
  Cross product of elements provided to CJ() would result in 70585808594 rows which exceeds .Machine$integer.max == 2147483647

I thought of using this package : bit64 but I'm not sure how.

version:

version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          5.1                         
year           2018                        
month          07                          
day            02                          
svn rev        74947                       
language       R                           
version.string R version 3.5.1 (2018-07-02)
nickname       Feather Spray

Edit 1

Last part of the code must be changed to:

finaldt <- dcast.data.table(dt, smallRNA ~FileName,
drop=FALSE,fill=0,value.var=counts)

Edit 2 issue with numbers lower than 1

in the combined dataset "dt" there aren't any values lower than 1 :

filter(dt,counts<1)
[1] FileName smallRNA counts  
<0 rows> (or 0-length row.names)
> myfiles[[1]] %>% filter(counts<1) %>% tail()
# A tibble: 6 x 2
  smallRNA                                                                                counts
  <chr>                                                                                    <dbl>
1 ENST00000592744.1 ncrna chromosome:GRCh38:9:81946438:81976806:-1 gene:ENSG00000267559… 0.00106
2 ENST00000594089.1 ncrna chromosome:GRCh38:11:64778954:64779405:1 gene:ENSG00000269038… 0.00106
3 ENST00000607991.1 ncrna chromosome:GRCh38:22:38743495:38743910:1 gene:ENSG00000273076… 0.00106
4 ENST00000608972.1 ncrna chromosome:GRCh38:7:29008926:29010252:1 gene:ENSG00000272568.… 0.00106
5 ENST00000618845.1 ncrna chromosome:GRCh38:14:49863072:49864379:1 gene:ENSG00000278002… 0.00106
6 ENST00000625800.1 ncrna chromosome:GRCh38:CHR_HG2232_PATCH:233205199:233205479:1 gene… 0.00106

Is there a way to include these values also?

Reading your question and answer, it's very difficult to tell what's going on. It seems like you solved your original question - it was just a typo, but then you edited to add a new question on the end that has something to do with small values? Rather than editing new questions into an old one (especially one that is answered), I'd strongly suggest asking a new focused question on the new issue. And providing sample data that illustrates the issue. — Gregor Thomas, Dec 05 '18 at 16:39
`bit64`, to my knowledge, provides a class for 64-bit integers, allowing R to work with larger integer values than it would otherwise. Your issue 2 seems to be with non-integer values between 0 and 1, so `bit64` isn't really relevant.., — Gregor Thomas, Dec 05 '18 at 16:40
@Gregor Yeap bit64 was my suggestion for the first question. But my question about having a lot of rows still exist. Should I make another post for the second? Edit: My answers are mere workarounds, that's why I leave it open. — K Y, Dec 05 '18 at 17:28
Questions are encouraged to be focused on one issue. That makes the questions more approachable as they tend to be shorter and clearer, and it doesn't discourage anyone from answering if they only have a solution for half. So yes, I think you should edit to remove the second issue from this question, and ask a new question focused solely on that. (Also with relevant data, the sample data you've shared doesn't seem to have any `count`s less than 1, so I'm guessing it doesn't illustrate your second issue.) — Gregor Thomas, Dec 05 '18 at 17:41
You may also want to edit your answer if you're still hoping for more. When I read "*So there was that problem within the code, edited and it's fixed,*" it doesn't sound like "a mere workaround" that you're hoping for more help on. It sounds like you "fixed" your problem. *If* someone reads on and gets to the *"regarding the issue when there are a lot of rows"*, maybe they would look back up at your question, but I don't see anythere there mentioning different behavior or problems with lots of rows... — Gregor Thomas, Dec 05 '18 at 17:45
Also, with questions anyone can view the edit history by clicking the "edited X time ago" link at the bottom, so you don't need to label edits in the text. Since you don't have any non-self answers yet, I'd recommend editing your original code with the improvements you've made so far, and make sure any remaining problems are clear. I've read your question 3 times now and I have no idea what remains to be solved about issue 1. And if your answer doesn't solve your problem, delete it. — Gregor Thomas, Dec 05 '18 at 17:48

dcast.data.table issue with large data and small decimal values

0 Answers0