1

Forgive the simplistic question, but I've read through the SO questions and the Python documentation and still haven't been able to figure this out.

How can I create a Python regex to test whether a string contains ANY but ONLY the A, U, G and C characters? The string can contain either one or all of those characters, but if it contains any other characters, I'd like the regex to fail.

I tried:

>>> re.match(r"[AUGC]", "AUGGAC")
<_sre.SRE_Match object at 0x104ca1850>

But adding an X on to the end of the string still works, which is not what I expected:

>>> re.match(r"[AUGC]", "AUGGACX")
<_sre.SRE_Match object at 0x104ca1850>

Thanks in advance.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Hakan B.
  • 2,319
  • 23
  • 29

8 Answers8

5

You need the regex to consume the whole string (or fail, if it can't). re.match implicitly adds an anchor at the start of the string, you need to add one to the end:

re.match(r"[AUGC]+$", string_to_check)

Also note the +, which repeatedly matches your character set (since, again, the point is to consume the whole string)

roippi
  • 25,533
  • 4
  • 48
  • 73
2

Use ^[AUCG]*$; this will match against the entire string.

Or, if there has to be at least one letter, ^[AUCG]+$^ and $ stand for beginning of string and end of string respectively; * and + stand for zero or more and one or more respectively.

This is purely about regular expressions and not specific to Python really.

Erik Kaplun
  • 37,128
  • 15
  • 99
  • 111
2

if the value is the only characters in the string, you can do the following:

>>> r = re.compile(r'^[AUGC]+$')
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
>>> 

then if you want your regex to match the empty string as well, you can do:

>>> r = re.compile(r'^[AUGC]*$')
>>> r.match("")
<_sre.SRE_Match object at 0x10ee16718>
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")

Here's a description of what the first regexp does:

Regular expression visualization

Walk through it

zmo
  • 24,463
  • 4
  • 54
  • 90
1

You are actually really close. What you have just tests for a single character that A or U or G or C.

What you want is to match a string that has one or more letters that are all A or U or G or C, you can accomplish this by adding the plus modifier to your regular expression.

re.match(r"^[AUGC]+$", "AUGGAC")

Additionally, adding $ at the end marks the end of string, you can optionally use ^ at the front to match the beginning of the string.

Hunter McMillen
  • 59,865
  • 24
  • 119
  • 170
1

Just check to see if there is anything other than "AUGC" in there:

if re.search('[^AUGC]', string_to_check):
    #fail

You can add a check to make sure the string is not empty in the same statement:

if not string_to_check or re.search('[^AUGC]', string_to_check):
    #fail
Rob Watts
  • 6,866
  • 3
  • 39
  • 58
1

No real need to use a regex:

>>> good = 'AUGGCUA'
>>> bad = 'AUGHACUA'
>>> all([c in 'AUGC' for c in good])
True
>>> all([c in 'AUGC' for c in bad])
False
Blair
  • 15,356
  • 7
  • 46
  • 56
  • 1
    No real need *not* to use them, either. They are the perfect job for this task. – Konrad Rudolph May 13 '14 at 19:37
  • @KonradRudolph - sure they work. But, to me anyway, its easier to see what is being checked in my code than having to parse the quantifiers, start and end of string markers etc in the regex solutions posted here. Each to their own though :). – Blair May 13 '14 at 19:45
1

I know you're asking about regular expressions but I though it was worth mentioning set. To establish whether your string only contains A U G or C, you could do this:

>>> input = "AUCGCUAGCGAU"
>>> s = set("AUGC")
>>> set(input) <= s
True
>>> bad = "ASNMSA"
>>> set(bad) <= s
False

edit: thanks to @roippi for spotting my mistake, <= should be used, not ==.

Instead of using <=, the method issubset can be used:

>>> set("AUGAUG").issubset(s)
True

if all characters in the string input are in the set s, then issubset will return True.

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
0

From: https://docs.python.org/2/library/re.html

Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

So you could do [^AUGC] and if it matches that then reject it, else keep it.

Darren
  • 79
  • 4