2

here's a tough one.

Problem Introduction:

I'm working with two different files: a GFF3, which is basically a "9 columns" TSV, and a FASTA, which is a text file. Now, importing the GFF3 file with gffpandas it looks like this:

                     seq_id  source  type     start       end score strand phase      attributes
1      ctg.s1.000000F_arrow  prokka  gene      56.0     244.0     .      +     .  NHDIEHOJ_00001
3      ctg.s1.000000F_arrow  prokka  gene     902.0    2167.0     .      -     .  NHDIEHOJ_00002
5      ctg.s1.000001F_arrow  prokka  gene    2363.0    2905.0     .      -     .  NHDIEHOJ_00003
7      ctg.s1.000003F_arrow  prokka  gene    2916.0    3947.0     .      -     .  NHDIEHOJ_00004
9      ctg.s2.000000F_arrow  prokka  gene    4353.0    5174.0     .      +     .  NHDIEHOJ_00005

Dropping the seq_id column I got the following "values":

ctg.s1.000000F_arrow
ctg.s1.000001F_arrow
ctg.s1.000003F_arrow
ctg.s2.000000F_arrow

Now let's step to the FASTA file, which looks like this:

>ctg.s1.000000F_arrow
CCGGAACATCGTCCTCATCCGCAAAGTCGAGCTCTGCCTCGATCATTGCACGCGAATGGGTCAGCCGTCGGGCCCAACCG
GCATAGAGTGCGGACTGTCCGCCACCGGACTGCTCTATGGCGAGACGACGCTGCATTTCCGTTTCTGCCGCAATCAGGTC
>ctg.s1.000001F_arrow
ACGCCGGCTGCAACTTTGAGAAGATGTGGCGATGTCGACCGCTGCATCCCGCCCTTCTCTGCAGAATTTTCCAGCTGTCC
GAGGACATTGGCAAAAAGGCCCTTGGAATCCTTGCGGCTCATTCTCTCCCCCATGCCTTCCAGAAGAGGCCCTCGAGTTC
>ctg.s1.000003F_arrow
GGCGCTGGTTTTCCCCGACACCTCGCCGCGCGGCGAGGGCGTGGCTGACGACGAGGCTTATGATCTCGGTCAGGGTGCGG
GCTTCTATGTCAATGCGACGCAGAAGCCCTGGTCGCCGCACTATCGCATGTATGATTATATCGTCACCGAATTGCCCGCC
>ctg.s2.000000F_arrow
GCGCTCGACGGCATGCCCGTACGCGGCCGATCCTGCGCCGCTTCCTTAACCTTAGCTGCGGATGGAAAGTCGTCCTCGGA
GTTCGGCTCGCAAACGCTTTCGAGCGCGCAATTGACGACGATGTCGTACCCAACTTAGATCGCCGAACGCCATGAGGTCG

Assuming that the uppercased text part is much longer than two lines, as you can see, the text part characterised by ">" symbol presents the same values of seq_id GFF3 column. As a matter of fact I wrote few line to assign to the FASTA file a dictionary in which the "key" is the text part characterised by ">" symbol, the "item" is the uppercased part.

Problem processing

For each attributes value inside the dataframe there's a corresponding start and end value which is an interval of the corresponding seq_id. I'd like to extract from the FASTA file that exact interval but with respect to the seq_id value which refers to. I mean the the interval 56-244 must be searched only for the FASTA sequence of ctg.s1.000000F_arrow, as well as 902-2167. The final goal is to obtain a dataframe which has an additional 10th column (es. 'sequence') that contains the corresponding FASTA sequence of the interval, like this:

                     seq_id  source  type     start       end score strand phase      attributes    sequence
1      ctg.s1.000000F_arrow  prokka  gene      56.0     244.0     .      +     .  NHDIEHOJ_00001    CCGGAACATCGTCCTCATCCG
3      ctg.s1.000000F_arrow  prokka  gene     902.0    2167.0     .      -     .  NHDIEHOJ_00002    CAAGGACATCGTGATCAATTC
5      ctg.s1.000001F_arrow  prokka  gene    2363.0    2905.0     .      -     .  NHDIEHOJ_00003    TCGCCGCGCGGCGAGTGATTA
7      ctg.s1.000003F_arrow  prokka  gene    2916.0    3947.0     .      -     .  NHDIEHOJ_00004    TCGAGCGCGCAATTGACGACG
9      ctg.s2.000000F_arrow  prokka  gene    4353.0    5174.0     .      +     .  NHDIEHOJ_00005    AGATCGCCGAACGCCATATTT

N.B. The sequences in sequence have been randomly typed of the same length but will differ proportionally to the end - start dimension for each attributes value.

I hope I was clear in the explanation. Thank you everybody for the help.

  • So to be clear, you already have the dictionary with the sequence ID as key and the nucleotides as value, and you want to insert in your DataFrame the matching sequence sliced between start and end? If the sequence is `AAACCCGGGTTT` and start/end is 3/7 you would want `ACCCT`? – mozway Oct 13 '22 at 22:01
  • Yes, precisely! – Iacopo Passeri Oct 14 '22 at 07:35

1 Answers1

2

Assuming df the DataFrame and dic the dictionary and the sequence indexing to be starting at 1 (not 0 like python indexing):

df['sequence'] = [dic[k][int(i-1):int(j)] for k, i, j in
                  zip(df['seq_id'], df['start'], df['end'])]
mozway
  • 194,879
  • 13
  • 39
  • 75