0

I'm trying to figure out how to extract the read-group lane information from a fastq file, and then use this string within my GATK AddOrReplaceReadGroups Snakemake below (below). I've written a short Python function (at the top of the rule) to do this operation, but can't quite figure out how to actually use it in the script such that I can access the string output (e.g., "ABCHGSX35:2") of the function on a given fastq file in my GATK shell command. Basically, I want to feed {params.PathToReadR1} into the extract_lane_info function, but can't figure out how to integrate them correctly.

Open to any solutions to the problem posed, or if there's an entirely more efficiently and different way to achieve the same result (getting the read-group lane info as a param value), that'd be great, too. Thanks for any help.

def extract_lane_info(file_path):
    # Run the bash command using subprocess.run()
    elements = subprocess.run(["zcat", file_path], stdout=subprocess.PIPE).stdout.split(b"\n")[0].split(b":")[2:4]
    # Extract the lane information from the output of the command using a regular expression
    read_group = elements[0].decode().strip("'")
    string_after = elements[1].decode().strip("'")
    elements = read_group + ":" + string_after
    return(elements)  

rule AddOrReplaceReadGroups:
    input:
         "results/{species}/bwa/merged/{read}_pese.bam",
 
     output:
        "results/{species}/GATK/AddOrReplace/{read}.pese.bwa_mem.addRG.bam",

    threads: config["trimmmomatic"]["cpu"]
    
    log:
        "results/{species}/GATK/AddOrReplace/log/{read}_AddOrReplace.stderrs"
    
    message:
        "Running AddOrReplaceReadGroups on {wildcards.read}"

    conda: 
        config["CondaEnvs"]
    
    params:
        ReadGroupID = lambda wildcards: final_Prefix_ReadGroup[wildcards.read],
        PathToReadR1 = lambda wildcards: final_Prefix_RawReadPathR1[wildcards.read],
        LIBRARY = lambda wildcards: final_Prefix_ReadGroupLibrary[wildcards.read],
    
    shell:"gatk AddOrReplaceReadGroups -I {input} -O {output} -ID {params.ReadGroupID}.1 -LB {params.LIBRARY} -PL illumina -PU {input.lane_info}:{params.ReadGroupID} -SM {params.ReadGroupID} --SORT_ORDER 'coordinate' --CREATE_INDEX true 2>> {log}"

Basically, {params.PathToReadR1} would be "path/to/file.fastq.gz", and I want this file to be inputted into the extract_lane_info function, and then the output of this function to be used in the -PU section of the shell command (e.g., {params.lane_info}. I keep getting all types of errors as I've messed around with it, and am unsure how to move forward.

  • Assuming `final_Prefix_RawReadPathR1` is a dictionary assigned somewhere else in code not shown, it seems like under `params:` you'd add `lane_info = extract_lane_info`, and then fix ` shell` command to have `-PU {params.lane_info}`. Right now the shell command reads `-PU {input.lane_info}`. Then you need to change your function to take wildcards as that is how functions work, see [here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules). So `def extract_lane_info(file_path):` becomes `def extract_lane_info(wildcards):` & then... – Wayne Dec 23 '22 at 23:45
  • in your function add towards the start of it, the line `file_path = final_Prefix_RawReadPathR1[wildcards.read]` . I think that is it because the functions used in params have to take wildcards at the very least. There's some other arguments possible but those don't help in your case, I think. – Wayne Dec 23 '22 at 23:52
  • Thanks for the answer @Wayne! I've added the below code to try to resolve the issue, but now I get a "bus error" for whatever reason (presumably some memory issue when trying to initiate the Snakemake run). Any ideas? `extract_lane_info(wildcards): file_path = final_Prefix_RawReadPathR1[wildcards.read][0] elements = subprocess.run(["zcat", file_path], stdout=subprocess.PIPE).stdout.split(b"\n")[0].split(b":")[2:4] read_group = elements[0].decode().strip("'") string_after = elements[1].decode().strip("'") elements = read_group + ":" + string_after return elements` – Ari Miller Dec 28 '22 at 17:21
  • (This is at the very beginning of the Rule file, prior to specifying the input or the rule itself.) I've added your suggested revisions as well in the shell command, @Wayne – Ari Miller Dec 28 '22 at 17:27
  • Sorry, I've not come across a 'bus error' that I recall. – Wayne Dec 29 '22 at 04:02

0 Answers0