Problems with getting the needed lines with regex

Question

i am new in the regex expression function with python. I have a file where i need to filter the aminoacid-sequence. Here is a quick look into the file:

>nxp:NX_A0A0A6YYD4-1 \PName=T cell receptor beta variable 13 isoform Iso 1 \GName=TRBV13 \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=124 \SV=5 \EV=31 \PE=3 \ModResPsi=(52|MOD:00798|half cystine)(120|MOD:00798|half cystine) \ModRes=(106||N-linked (GlcNAc...) asparagine) \VariantSimple=(18|H)(27|V) \Processed=(1|31|PEFF:0001021|signal peptide)(32|124|PEFF:0001020|mature protein) MLSPDLPDSAWNTRLLCRVMLCLLGAGSVAAGVIQSPRHLIKEKRETATLKCYPIPRHDT VYWYQQGPGQDPQFLISFYEKMQSDKGSIPDRFSAQQFSDYHSELNMSSLELGDSALYFC ASSL

>nxp:NX_A0A1B0GV90-1 \PName=Cortexin domain containing 2 isoform Iso 1 \GName=CTXND2 \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=55 \SV=1 \EV=11 \PE=3 \VariantSimple=(13|N)(22|F)(29|T)(34|Q)(45|T) \Processed=(1|55|PEFF:0001020|mature protein) MEDSSLSSGVDVDKGFAIAFVVLLFLFLIVMIFRCAKLVKNPYKASSTTTEPSLS

I could filter out the endpoint of the needed point and the starting point. As you can see in my code, the beginning ist after the coordinates after the \VarableSimple and the end should be the next ">" character. Now i cannot find the way to filter out the MLSPDLPD..... sequence. Could someone give me an idea?

with open('PATH/XYZ', 'r') as f:
    data = f.read()

import regex

h = regex.compile("(.*)\n").match(data)
header = h.group(1)
start = regex.match(".+\\\VariantSimple=(\([^)]+\))*\s{0,1}", data)
start.captures(1)
end = regex.compile("(.)*\>").match(data)
end.captures(0)

You should be able to match the thing you actually want in the first place... or do you need other things from the first match? If you can show _exactly_ what you want to get out at the end, that would help a lot. Is it just the long string of uppercase chars? — Matt Hall, Oct 11 '19 at 13:06
in this case i need the sequence MLSPDLPDSAWNTRLLCRVMLCLLGAGSVAAGVIQSPRHLIKEKRETATLKCYPIPRHDT VYWYQQGPGQDPQFLISFYEKMQSDKGSIPDRFSAQQFSDYHSELNMSSLELGDSALYFC ASSL — menbar, Oct 11 '19 at 13:10
Have you tried looking if there's already a library to parse and grab information from that file type. It seems to be something related to genomics. — Diego Allen, Oct 11 '19 at 13:15
yes i need it for my project in the proteomic field. sadly i dont know any packages and functions parse the sequence. — menbar, Oct 11 '19 at 13:19
I guess its a simple fasta file containing a header and aminoacid sequences? Have a look at the biopython packages, you will find lots of useful methods for handling those files. — voiDnyx, Oct 11 '19 at 13:24
If the last value is always after a closing `)` you could match `VariantSimple=` in the string and then match until the last occurrence of `)` and capture the rest in group 1 https://regex101.com/r/OhYaff/1 `^>nxp:.*\\VariantSimple=.*\)(.*)` — The fourth bird, Oct 11 '19 at 18:07
Please remember to accept one of the answers if it addresses your question. Welcome to StackOverflow! — Nick Reed, Oct 14 '19 at 13:42
Looks like you are parsing PEFF files. `pyteomics` has a [module](https://pyteomics.readthedocs.io/en/latest/api/peff.html) for that. — Lev Levitsky, Dec 01 '19 at 15:26

Matt Hall · Answer 1 · 2019-10-11T13:31:52.397

In general there are 3 ways to parse data like this:

Use str methods.
Use regex, e.g. with re.
Use a parser, e.g. parsimonious

Here's a really fantastic answer on SO about using regex and parsers.

They are all fiddly. But string methods are easy to debug, and regex and parsers... aren't. So my first move would be to try to unpack the data with string methods, maybe like this:

d = data.split('\\')

nxp, items, this_item = None, {}, {}

for item in d:

    if 'nxp' in item:
        if nxp:
            items[nxp] = this_item
            this_item = {}
        nxp = item.strip().split(':')[-1]
        continue

    if '=' in item:
        key, value = item.strip().split('=')
        this_item[key] = value

else:
    items[nxp] = this_item

This results in a dictionary of data:

{'NX_A0A0A6YYD4-1': {'PName': 'T cell receptor beta variable 13 isoform Iso 1',
  'GName': 'TRBV13',
  'NcbiTaxId': '9606',
  'TaxName': 'Homo Sapiens',
  'Length': '124',
  'SV': '5',
  'EV': '31',
  'PE': '3',
  'ModResPsi': '(52|MOD:00798|half cystine)(120|MOD:00798|half cystine)',
  'ModRes': '(106||N-linked (GlcNAc...) asparagine)',
  'VariantSimple': '(18|H)(27|V)',
  'Processed': '(1|31|PEFF:0001021|signal peptide)(32|124|PEFF:0001020|mature protein) MLSPDLPDSAWNTRLLCRVMLCLLGAGSVAAGVIQSPRHLIKEKRETATLKCYPIPRHDT VYWYQQGPGQDPQFLISFYEKMQSDKGSIPDRFSAQQFSDYHSELNMSSLELGDSALYFC ASSL'},
 'NX_A0A1B0GV90-1': {'PName': 'Cortexin domain containing 2 isoform Iso 1',
  'GName': 'CTXND2',
  'NcbiTaxId': '9606',
  'TaxName': 'Homo Sapiens',
  'Length': '55',
  'SV': '1',
  'EV': '11',
  'PE': '3',
  'VariantSimple': '(13|N)(22|F)(29|T)(34|Q)(45|T)',
  'Processed': '(1|55|PEFF:0001020|mature protein) MEDSSLSSGVDVDKGFAIAFVVLLFLFLIVMIFRCAKLVKNPYKASSTTTEPSLS'}}

And this feels easier to wield.

Now we can go after that sequence of characters, and maybe this isn't too hard and we can think about using regex without getting a headache, e.g.:

import re
re.search(r'\) ([ A-Z]+)', items['NX_A0A0A6YYD4-1']['Processed']).groups()[0]

This gives:

'MLSPDLPDSAWNTRLLCRVMLCLLGAGSVAAGVIQSPRHLIKEKRETATLKCYPIPRHDT VYWYQQGPGQDPQFLISFYEKMQSDKGSIPDRFSAQQFSDYHSELNMSSLELGDSALYFC ASSL'

@menbar Don't forget to upvote things you find useful :) – Matt Hall Oct 14 '19 at 08:14 — Matt Hall, Oct 14 '19 at 08:14

Nick Reed · Accepted Answer · 2019-10-11T13:29:37.407

\\VariantSimple=((?:$[^$]+\))*) \\Processed=((?:$[^$]+\))*) ([\s\S]*?)(?:\n*>|$)

This regex will capture your amino acid sequence. After closing out the "Processed" data field(s), it captures all characters across lines until it comes to newlines followed by an > character, or the end of a line. This should be adaptable to your python code.

Regex demo

An example code would look something like this; it will match as many amino acid strings as it can find, and then print them out.

import re

with open('data.txt', 'r') as fil:
  data = fil.read()


rex = re.compile("\\\VariantSimple=(?:\([^\)]+\))* \\\Processed=(?:\([^\)]+\))* ([\s\S]*?)(?:\n*>|$)")
rex2 = re.compile("Variant")

out = re.findall(rex, data)

for mtch in out:
  print(mtch + "\n")

Output:

MLSPDLPDSAWNTRLLCRVMLCLLGAGSVAAGVIQSPRHLIKEKRETATLKCYPIPRHDT VYWYQQGPGQDPQFLISFYEKMQSDKGSIPDRFSAQQFSDYHSELNMSSLELGDSALYFC ASSL

MEDSSLSSGVDVDKGFAIAFVVLLFLFLIVMIFRCAKLVKNPYKASSTTTEPSLS

Python demo

Problems with getting the needed lines with regex

2 Answers2