1

I have the following pandas series:

arr = pd.Series(['C', 'A', 'T', 'G', 'CC', 'KEEP', 'ATC', 'CACACAC', 'CCCCCCCCACAGTTTATGTAG', 'C(2', 'Cor CC', 'AC or ACC'])

From it, I want to remove the elements C(2, Cor CC and AC or ACC using regex

So the criteria that I am trying to match are:

  1. Start with a capital letter: ^[A-Z]
  2. Exclude any element that has a parenthesis in it: [^\(]
  3. Exclude any element that has the string or

arr.str.contains(r'^[A-Z][\(]') will match C(2 whereas I can match Cor CC and AC or ACC with arr.str.contains(r'\w*or.\w*'.

I can then pop out these elements from my list, but I am trying to keep the elements of interest (i.e. without C(2, Cor CC and AC or ACC) using regular expression

Mazdak
  • 105,000
  • 18
  • 159
  • 188
BCArg
  • 2,094
  • 2
  • 19
  • 37

1 Answers1

0

You may use

arr[~arr.str.contains(r'^[A-Z]\(|or')]

Details

  • ^[A-Z]\( - an uppercase ASCII letter and ( at the start of a string
  • | - or
  • or - an or substring

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563