0

I'm working with 12 large data files, all of which hover between 3 and 5 GB, so I was turning to RSQLite for import and initial selection. Giving a reproducible example in this case is difficult, so if you can come up with anything, that would be great.

If I take a small set of the data, read it in, and write it to a table, I get exactly what I want:

con <- dbConnect("SQLite", dbname = "R2")
f <- file("chr1.ld") 
open(f) 
data <- read.table(f, nrow=100, header=TRUE)
dbWriteTable(con, name = "Chr1test", value = data)

> dbListFields(con, "Chr1test")
[1] "row_names" "CHR_A"     "BP_A"      "SNP_A"     "CHR_B"     "BP_B"      "SNP_B"     "R2"     

> dbGetQuery(con, "SELECT * FROM Chr1test LIMIT 2")
  row_names CHR_A BP_A      SNP_A CHR_B BP_B       SNP_B         R2
1         1     1 1579 SNP-1.578.     1 2097 SNP-1.1096. 0.07223050
2         2     1 1579 SNP-1.578.     1 2553 SNP-1.1552. 0.00763724

If I read in all of my data directly to a table, though, my columns aren't separated correctly. I've tried both sep = " " and sep = "\t", but both give the same column separation

dbWriteTable(con, name = "Chr1", value ="chr1.ld", header = TRUE)
> dbListFields(con, "Chr1")
[1] "CHR_A_________BP_A______________SNP_A__CHR_B_________BP_B______________SNP_B___________R

I can tell that it's clearly some sort of delimination issue, but I've exhausted my ideas on how to fix it. Has anyone run into this before?

*Edit, update: It seems as though this works:

n <- 1000000
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow = n, header = TRUE)

con_data <- dbConnect("SQLite", dbname = "R2")
while (nrow(data) == n){
  dbWriteTable(con_data, data, name = "ch1", append = TRUE, header = TRUE)
  data <- read.table(f, nrow = n, header = TRUE)
}
close(f)
if (nrow(data) != 0){
  dbWriteTable(con_data, data, name = "ch1", append = TRUE)
}

Though I can't quite figure out why just writing the table through SQLite is a problem. Possibly a memory issue.

MHtaylor
  • 29
  • 6
  • If `read.table(f, nrow=100, header=TRUE)` for 100, do you know exactly at which value it stops working? Have you compared the `str()` of `data` when you just read 100 rows vs all your row? There's probably something wrong with the import. However, since you have not provided a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), there's no way to know for sure what's going on. We can only guess. – MrFlick Sep 22 '14 at 16:29
  • read.table(f, nrow=100, header = TRUE) works, and I get 100 rows exactly as I expected. I can't really provide a reproducible example as the file is extremely huge, and this seems to be causing the error. I can provide files otherwise, but can't recreate this error with smaller sets. – MHtaylor Sep 22 '14 at 19:00

1 Answers1

1

I am guessing that your big file is causing a free memory issue (see Memory Usage under docs for read.table). It would have been helpful to show us the first few lines of chr1.ld (on *nix systems you just say "head -n 5 chr1.ld" to get the first five lines). If it is a memory issue, then you might try sipping the file as a work-around rather than gulping it whole. Determine or estimate the number of lines in chr1.ld (on *nix systems, say "wc -l chr1.ld"). Let's say your file has 100,000 lines.

`sip.size = 100
for (i in seq(0,100000,sip.size)) {
    data <- read.table(f, nrow=sip.size, skip=i, header=TRUE)
    dbWriteTable(con, name = "SippyCup", value = data, append=TRUE)
}`

You'll probably see warnings at the end but the data should make it through. If you have character data that read.table is trying to factor, this kludge will be unsatisfactory unless there are only a few factors, all of which are guaranteed to occur in every chunk. You may need to tell read.table not to factor those columns or use some other method to look at all possible factors so you can list them for read.table. (On *nix, split out one column and pipe it to uniq.)

Patrick
  • 91
  • 3
  • I think I came up with something quite similar, though I imagine this would work. My data is around 53 million lines, so that could be the issue – MHtaylor Sep 25 '14 at 11:16