Parse .bgen files using HAIL without loading data on a single node

Question

I am trying to parse genomic data that is delivered in a .bgen format to a Spark DF using HAIL. The file is 150 GB large and it won't fit into a single node on my cluster.

I am wondering whether there are streaming commands/ways to parse the data into my desired target format that don't require me to load the data into memory up front.

I would really appreciate any inputs/ideas! Thanks a lot!

score 0 · Answer 1 · answered Sep 11 '20 at 15:56

Could you use use a stand-alone BGEN reader to get what you need and then move it to the format you want?

    import numpy as np
    from bgen_reader import open_bgen

    bgen = open_bgen("M:/deldir/genbgen/good/merged_487400x1100000.bgen")
     # read all samples and variants 1M to 1M+31
    val = bgen.read(np.s_[:,1000000:1000031])
    print(val.shape)

=> (487400, 31, 3)

The 'bed-reader' library offers a NumPy-inspired API that makes it very fast and easy to read slices of BGEN files into NumPy arrays. The first time it reads, it creates a metadata file. After that, it starts instantly and it reads millions of probabilities per second.

I'm happy to help with usage or questions.

Carl

score 0 · Answer 2 · answered Jan 06 '23 at 21:13

Hail does not load the data into memory, it streams through it. What error did you encounter? The following should work just fine:

import hail as hl

mt = hl.import_bgen('gs://path/to/file.bgen')
mt.show()

You can use to_spark to get a Spark data frame from the Hail Matrix Table.

Parse .bgen files using HAIL without loading data on a single node

2 Answers2