I am trying to parse genomic data that is delivered in a .bgen format to a Spark DF using HAIL. The file is 150 GB large and it won't fit into a single node on my cluster.
I am wondering whether there are streaming commands/ways to parse the data into my desired target format that don't require me to load the data into memory up front.
I would really appreciate any inputs/ideas! Thanks a lot!