1

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.

The directory contains multiple files with 'vcf.gz.tbi' and 'vcf.gz' endings. Using the following code:

file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"

I tried to create a file path that only grabs the '.vcf.gz' files while excluding the '.vcf.gz.tbi' but I have been unsuccesful.

Ava Wilson
  • 33
  • 5
  • Does it need to be a url, as opposed to an ordinary local file path? – Tom Karzes Jan 19 '23 at 19:18
  • Im not sure if there is an alternative that I could use. I need a recursive object that feeds into this command mt = hl.import_vcf(file_url, force_bgz=True, reference_genome="GRCh38", array_elements_required=False) – Ava Wilson Jan 19 '23 at 19:25
  • It's not a problem. I was just surprised it needed to be a url. If you have the file path, constructing a local url from it is easy. – Tom Karzes Jan 19 '23 at 20:30

1 Answers1

2

The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn't the only option:

import glob, os

file_url = "file:///mnt/projects/samples/vcf_format/"

os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
    print(file)

Note that the file path doesn't contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.

Check out this answer for more options.

It took some digging but it looks like you're trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:

import glob, os

file_url = "file:///mnt/projects/samples/vcf_format/"


def get_vcf_list(path):
    vcf_list = []
    os.chdir(path)
    for file in glob.glob("*.vcf.gz"):
       vcf_list.append(path + "/" + file)
    return vcf_list


get_vcf_list(file_url)

# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'

mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)