4

I have a 20 gb file which looks like the following:

Read name, Start position, Direction, Sequence

Note that read names are not neccessarily unique.

E.g. a snippet of my file would look like

Read1, 40009348, +, AGTTTTCGTA
Read2, 40009349, -, AGCCCTTCGG
Read1, 50994530, -, AGTTTTCGTA

I want to be able to store these lines in a way that allows me to

  1. keep the file sorted based on the second value
  2. iterate over the sorted file

It seems that databases can be used for this.

The documentation seems to imply that dbm cannot be used to sort the file and iterate over it.

Therefore I'm wondering whether SQLite3 will be able to do 1) and 2). I know that I will be able to sort my file with a SQL-query and iterate over the resultset with sqlite3. However, will I be able to do this without running out of memory on a 4gb of RAM computer?

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • 1
    You could read the file in line by line, load it into an sqlite3 database then do an sql statement with `ORDERBY`# – Jakob Bowyer Nov 08 '12 at 17:07
  • 1
    And don't foget to create an index on the `Start position`. –  Nov 08 '12 at 17:08
  • 1
    You could partition out the data to multiple files based on the first few digits of the start position and sort each file. But the proper way is probably a real database. – Steven Rumbalski Nov 08 '12 at 17:13
  • Inserting a record "into the middle" of a disk file in an efficient fashion – i.e. while avoiding having to rewrite most of the file – is a nontrivial problem, and pretty much one of the reasons why we use databases. They should be able to handle 1) and 2) using orders of magnitude less in terms of memory than the dataset size. That said, iterating over *everything* will still be slow, there's no getting around the fact that processing 20GB worth of data in total takes a long time. – millimoose Nov 08 '12 at 17:39

2 Answers2

8

SQLite is able to do both 1) and 2).

I recommend you try it and report any problems you encounter.

With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes.

saaj
  • 23,253
  • 3
  • 104
  • 105
  • Wow, so SQLite3 won't limit me. Thanks. – The Unfun Cat Nov 08 '12 at 17:08
  • 1
    @TheUnfunCat Note that in theory, you could have up to a 140 TB database (according to [item 12 on their limits page](http://www.sqlite.org/limits.html)). While technically possible, performance will likely be abysmal. See my answer for a hint at the performance. – ernie Nov 08 '12 at 17:22
2

See this question about large SQLlite databases.

The important bit:

I tried to insert multiple rows into a sqlite file with just one table. When the file was about 7GB (sorry I can't be specific about row counts) insertions were taking far too long. I had estimated that my test to insert all my data would take 24 hours or so, but it did not complete even after 48 hours.

The sample used was ~50GB of data, though system specs are not mentioned.

Community
  • 1
  • 1
ernie
  • 6,356
  • 23
  • 28
  • Ah, so just inserting my 20 GB file will take long. I'll only have to do it once so I might try it. My advisor recommended this: http://docs.python.org/2.7/library/bsddb.html but it is deprecated for some reason. – The Unfun Cat Nov 08 '12 at 17:26
  • It might not take that long . . . a lot depends on if you index or not. I'd suggesting trying it and profiling the performance. Also depending on the analysis you want to do of the biometric data, you may want to split the data in other forms, use a complete relational DB, etc. Or maybe just get your advisor to purchase a system with 32GB of RAM . . . ;) – ernie Nov 08 '12 at 17:27
  • @TheUnfunCat The Berkeley DB is mostly a key-value store. I don't think it's intended to do sorting at all. – millimoose Nov 08 '12 at 17:40
  • Also, 2009 was a while ago. Odds are using a SSD (as you absolutely should for any sort of nontrivial database work) would improve things a lot. – millimoose Nov 08 '12 at 17:42
  • Makes sense, I described a slightly simpler problem to him. Thanks for the info. And I have SSD! Look forward to trying. All the input here has been valuable. Thanks. – The Unfun Cat Nov 08 '12 at 17:42