String may only contain A, U, G or C

Question

Forgive the simplistic question, but I've read through the SO questions and the Python documentation and still haven't been able to figure this out.

How can I create a Python regex to test whether a string contains ANY but ONLY the A, U, G and C characters? The string can contain either one or all of those characters, but if it contains any other characters, I'd like the regex to fail.

I tried:

>>> re.match(r"[AUGC]", "AUGGAC")
<_sre.SRE_Match object at 0x104ca1850>

But adding an X on to the end of the string still works, which is not what I expected:

>>> re.match(r"[AUGC]", "AUGGACX")
<_sre.SRE_Match object at 0x104ca1850>

Thanks in advance.

your regex only checks the first character.... – Gryphius May 13 '14 at 19:31 — Gryphius, May 13 '14 at 19:31

score 5 · Accepted Answer · answered May 13 '14 at 19:31

You need the regex to consume the whole string (or fail, if it can't). re.match implicitly adds an anchor at the start of the string, you need to add one to the end:

re.match(r"[AUGC]+$", string_to_check)

Also note the +, which repeatedly matches your character set (since, again, the point is to consume the whole string)

Erik Kaplun · Answer 2 · 2014-05-13T19:43:48.363

2

Use ^[AUCG]*$; this will match against the entire string.

Or, if there has to be at least one letter, ^[AUCG]+$ — ^ and $ stand for beginning of string and end of string respectively; * and + stand for zero or more and one or more respectively.

This is purely about regular expressions and not specific to Python really.

edited May 13 '14 at 19:43

answered May 13 '14 at 19:32

Erik Kaplun

37,128
15
99
111

score 2 · Answer 3 · answered May 13 '14 at 19:33

if the value is the only characters in the string, you can do the following:

>>> r = re.compile(r'^[AUGC]+$')
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
>>>

then if you want your regex to match the empty string as well, you can do:

>>> r = re.compile(r'^[AUGC]*$')
>>> r.match("")
<_sre.SRE_Match object at 0x10ee16718>
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")

Here's a description of what the first regexp does:

Regular expression visualization

Walk through it

score 1 · Answer 4 · answered May 13 '14 at 19:32

You are actually really close. What you have just tests for a single character that A or U or G or C.

What you want is to match a string that has one or more letters that are all A or U or G or C, you can accomplish this by adding the plus modifier to your regular expression.

re.match(r"^[AUGC]+$", "AUGGAC")

Additionally, adding $ at the end marks the end of string, you can optionally use ^ at the front to match the beginning of the string.

score 1 · Answer 5 · answered May 13 '14 at 19:33

Just check to see if there is anything other than "AUGC" in there:

if re.search('[^AUGC]', string_to_check):
    #fail

You can add a check to make sure the string is not empty in the same statement:

if not string_to_check or re.search('[^AUGC]', string_to_check):
    #fail

score 1 · Answer 6 · answered May 13 '14 at 19:36

1

No real need to use a regex:

>>> good = 'AUGGCUA'
>>> bad = 'AUGHACUA'
>>> all([c in 'AUGC' for c in good])
True
>>> all([c in 'AUGC' for c in bad])
False

answered May 13 '14 at 19:36

Blair

15,356
7
46
56

1

No real need *not* to use them, either. They are the perfect job for this task. – Konrad Rudolph May 13 '14 at 19:37
@KonradRudolph - sure they work. But, to me anyway, its easier to see what is being checked in my code than having to parse the quantifiers, start and end of string markers etc in the regex solutions posted here. Each to their own though :). – Blair May 13 '14 at 19:45

Tom Fenech · Answer 7 · 2014-05-13T19:45:18.933

1

I know you're asking about regular expressions but I though it was worth mentioning set. To establish whether your string only contains A U G or C, you could do this:

>>> input = "AUCGCUAGCGAU"
>>> s = set("AUGC")
>>> set(input) <= s
True
>>> bad = "ASNMSA"
>>> set(bad) <= s
False

edit: thanks to @roippi for spotting my mistake, <= should be used, not ==.

Instead of using <=, the method issubset can be used:

>>> set("AUGAUG").issubset(s)
True

if all characters in the string input are in the set s, then issubset will return True.

edited May 13 '14 at 19:45

answered May 13 '14 at 19:38

Tom Fenech

72,334
12
107
141

sets are a good idea, but your implementation is not quite right. `AUGAUG` should match, not fail. – roippi May 13 '14 at 19:40
@roippi good point, I've corrected it. – Tom Fenech May 13 '14 at 19:43

score 0 · Answer 8 · answered May 13 '14 at 19:41

From: https://docs.python.org/2/library/re.html

Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

So you could do [^AUGC] and if it matches that then reject it, else keep it.

String may only contain A, U, G or C

8 Answers8