0

I am new to python and wanted to try it to extract text between the matching pattern in each line of my tab delimited text file (mydata)

mydata.txt:

Sequence                                                                                                            tRNA    Bounds  tRNA    Anti    Intron Bounds   Cove
Name                                                                                                            tRNA #  Begin   End Type    Codon   Begin   End Score
--------                                                                                                        ------  ----    ------  ----    -----   -----   ----    ------
lcl|NC_035155.1_gene_75[locus_tag=SS1G_20133][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_73[locus_tag=SS1G_20131][db_xref=GeneID:33                                                 1   1   73  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_72[locus_tag=SS1G_20130][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_71[locus_tag=SS1G_20129][db_xref=GeneID:33                                                 1   1   72  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_62[locus_tag=SS1G_20127][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1

Code I tried:

lines = [] #Declare an empty list named "lines"
with open('/media/owner/c3c5fbb4-73f6-45dc-a475-988ad914056e/phasing/trna/test.txt') as input_data:
    # Skips text before the beginning of the interesting block:
    for line in input_data:
        # print(line)
        if line.strip() == "locus_tag=":  # Or whatever test is needed
            break
    # Reads text until the end of the block:
    for line in input_data:  # This keeps reading the file
        if line.strip() == "][db":
            break
        print(line)  # Line is extracted (or block_of_lines.append(line), etc.)

I want to grab texts between [locus_tag= and ][db_xre and get these as my results:

SS1G_20133
SS1G_20131
SS1G_20130
SS1G_20129
SS1G_20127
MAPK
  • 5,635
  • 4
  • 37
  • 88

2 Answers2

1

If I'm understanding correctly, this should work for a given line of your data:

data = line.split("locus_tag=")[1].split("][db_xref")[0]

The idea is to split the string on locus_tag=, take the 2nd element, then split that string on ][db_xref and take the first element.

If you want help with the outer loop it could look like:

for line in open(file_path, 'r'):
    if "locus_tag" in line:
        data = line.split("locus_tag=")[1].split("][db_xref")[0]
        print(data)
derricw
  • 6,757
  • 3
  • 30
  • 34
1

You can use re.search with positive lookbehind and positive lookahead patterns:

import re
...
for line in input_data:
    match = re.search(r'(?<=\[locus_tag=).*(?=\]\[db_xre)', line)
    if match:
        print(match.group())
blhsing
  • 91,368
  • 6
  • 71
  • 106