0

I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:

[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]    
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]  
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]

The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?

potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
    status=False
    uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
    uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
    uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
    if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
        status=True
    return status
correctIDs=[]
for prot in potential_uniprots:
    if is_uniprot(prot) == True:
        correctIDs.append(prot)
print(correctIDs)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Sarah
  • 79
  • 8
  • 1
    Does this answer your question? [Regex to match words of a certain length](https://stackoverflow.com/questions/9043820/regex-to-match-words-of-a-certain-length) – Marcin Mrugas Nov 30 '20 at 10:02
  • Does this answer your question? https://stackoverflow.com/questions/4007302/regex-how-to-match-an-optional-character – Marcin Mrugas Nov 30 '20 at 10:03
  • You can omit `{1}` and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group `(?:-[0-9])?` – The fourth bird Nov 30 '20 at 10:10
  • 1
    @Thefourthbird please see my answer. I pointed to your comment as credit for the "Expression FIxes" section of my answer. Please let me know if you would like me to change anything or delete the section as a whole. – gmdev Nov 30 '20 at 13:00

1 Answers1

1

Expression Fixes:

BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:

You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?

You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.

Checking Length?

There is a simple way to do this without regex, which is as follows:

string = "Q08F88"
status = (len(string) == 6 or len(string) == 8) 

But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.

Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:

abcd

And not:

eabcd
abcde

This is because ^ denotes the start of the string and $ denotes the end of the string.

In the end, you're left with this first expression:

(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)

You can modify your other expressions easily as they follow the same structure as above.

Code Suggestions

Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:

if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
   status=True
return status

To this:

return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))

# -OR-

stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status

Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.

gmdev
  • 2,725
  • 2
  • 13
  • 28