2

I want to do some analysis an a pretty dang big file:

$ ls -lSh jq1pileup
-rw-rw-r--+ 1 balter SomeGroup 80G Nov 15 12:23 jq1pileup
$ wc jq1pileup
 3099750719 30997507190 85744405658 jq1pileup

But, fortunately I'm on a cluster with some pretty beefy machines

$ free -mhtal
             total       used       free     shared    buffers     cached  available
Mem:           94G        71G        22G       1.4G       592M        50G         0B
Low:           94G        71G        22G
High:           0B         0B         0B
-/+ buffers/cache:        20G        73G
Swap:         195G       6.1G       188G
Total:        289G        77G       211G

I'm finding that reading in my file is taking an extremely long time (like measured in hours). What is reasonable to expect? Doing something simple like getting the shape or, horrors, a histogram again takes hours.

Is this what I should expect for a task such as this?

EDIT:

The file is a TSV file. (FWIW, an pileup of genomic abundance). Oh, and it is not apparent from WC, but it has 9 columns.

abalter
  • 9,663
  • 17
  • 90
  • 145
  • 1
    What kind of file is it (CSV, something else)? How do you read it? – MaxU - stand with Ukraine Jan 20 '17 at 19:05
  • @MaxU it is a TSV file. (FWIW, an pileup of genomic abundance). Oh, and it is not apparent from WC, but it has 9 columns. Adding to original question. – abalter Jan 20 '17 at 19:16
  • can you take a random sample for your analysis or do you absolutely need all of it? take maybe every 100th line into a new file... – Aaron Jan 20 '17 at 19:17
  • 2
    @abalter, can you provide a sample of that file (for example first 3-5 rows) and post your code (how do you read it)? – MaxU - stand with Ukraine Jan 20 '17 at 19:20
  • @Aaron I'm considering that. Some of the things I want to do are histogram and non-parametric stats like Spearman's correlation. Histogram I could do by serial binning and then combining bins. Non-parametric correlation is not so easy (although I could take random samplings of long, contiguous regions (but not sure how to "average" correlation values). There are such options. But brute force would be the easiest -- send it off to the cluster and check it later when done. – abalter Jan 20 '17 at 19:22
  • 2
    Have you had a look at http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas? There are some good ideas there. – Alex Riley Jan 20 '17 at 19:23
  • Those are great ideas! One that really hit me right in the head is that the numbers are all in the `int16` or at least `int32` range. Reading in with the correct data type should help a lot! – abalter Jan 20 '17 at 19:27
  • 3
    also I would make a strong argument for writing your analysis to work on a stream of data as the file is read in. even if you have to read it twice, you may find this to be a good approach that uses minimal memory. reading an 80G file off a conventional hard drive will likely take no less than 10-15 min assuming maximal throughput, so don't expect miracles – Aaron Jan 20 '17 at 19:29
  • @abalter, `Swap: used: 6.1G` - this doesn't look good. You should try to avoid swapping at all. – MaxU - stand with Ukraine Jan 20 '17 at 19:32
  • Also, if the data is decently compressible I'd try to store it compressed with something fast like LZO, to minimize the influence of the disk bottleneck. – Matteo Italia Jan 20 '17 at 20:06
  • Have you considered getting more memory and a faster disk? 256 or even 512 GB of RAM isn't that prohibitive anymore, and a M2 disk should easily get you 80-90 MB/sec in raw read/write speed. – thebjorn Jan 20 '17 at 21:17
  • @thebjorn the node has 289GB of total RAM and, at the moment I checked, 211GB free RAM. – abalter Jan 20 '17 at 22:46
  • @Aaron that would be a good approach. The Pearson's correlation is linear, so I wrote a function that calculates it by reading each line one at a time and accumulating moments. For rank-based statistics, it's not so easy. I wrote a function that reads in the file and accumulates ranks into a dictionary, then writes a new file the same size as the original but with ranks substituted for the original data, then performs the Pearson's correlation. It works, but clunky. Also, I could roll my own of all sorts of things, but I'd rather leverage Pandas. – abalter Jan 20 '17 at 22:49
  • Retrieving the shape takes hours? I'm pretty sure that shape lookup time should not be affected by the size of the array. – Paul Jan 21 '17 at 01:51

0 Answers0