I have the following pandas series:
arr = pd.Series(['C', 'A', 'T', 'G', 'CC', 'KEEP', 'ATC', 'CACACAC', 'CCCCCCCCACAGTTTATGTAG', 'C(2', 'Cor CC', 'AC or ACC'])
From it, I want to remove the elements C(2
, Cor CC
and AC or ACC
using regex
So the criteria that I am trying to match are:
- Start with a capital letter:
^[A-Z]
- Exclude any element that has a parenthesis in it:
[^\(]
- Exclude any element that has the string
or
arr.str.contains(r'^[A-Z][\(]')
will match C(2
whereas I can match Cor CC
and AC or ACC
with arr.str.contains(r'\w*or.\w*'
.
I can then pop out these elements from my list, but I am trying to keep the elements of interest (i.e. without C(2
, Cor CC
and AC or ACC
) using regular expression