1

I have a user input for a DNA sequence analysis program, I want to check if the input is in fact a sequence i.e. that it contains A, C, T, G or a, c, t, g.

I have thought of implementing a regular expression where the re.search would return True if the correct format was found. Then if false I can ask for the input again etc. Like so:

input = "ATGGCAAT"
>>True

input = "atg"
>>True

input = "AATG!4"
>>False

input = "this input contains all the char but is in the wrong format"
>>False

I have also considered using a negative look ahead that would match with anything other than the correct format.

Dharman
  • 30,962
  • 25
  • 85
  • 135

3 Answers3

2

You need to check that the string contains ACTG in lower or upper cases and only them, so you anchor the expression at the start and the end of the line:

import re
re.match("(?i)^[ACTG]+$", input)
Alex Sveshnikov
  • 4,214
  • 1
  • 10
  • 26
0

You can use start and end of string operators, and then specify the characters you want, one or more times, like so:

^[actgACTG]+$

You can find your example here: https://regex101.com/r/CgiTEL/1

João Amaro
  • 195
  • 1
  • 8
0

Non-Regex solution. This function will check the string and return False if any character doesn't match something in your designated list, otherwise will return True

def test(string_input):
    for s in string_input:  # loop through each character in the string
        if s.lower() not in ["a", "c", "t", "g"]:  # lower() to change s to lowercase
            return False
    else:  # if all characters in string pass at end of loop return True
       return True


string_input = "AATG!4"
test(string_input)
>> False
Meshi
  • 470
  • 4
  • 16