1

Not sure if this question has been asked before, but I couldn't find it, so here it is:

randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
randomList2 = []
for i in randomList:
  if i <contains any characters other than "A",C","G", or "T">:
    <add a string without junk to randomList2>

How would I do all the things within <>? Thanks,

Pydronia
  • 21
  • 6

3 Answers3

4
>>> randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
>>> import re
>>> [re.sub("[^ACGT]+", "", s) for s in randomList]
['ACGT', 'AG', 'AGCT']

[^ACGT]+ matches one or more (+) characters except ACGT.

Some timings:

>>> import timeit
>>> setup = '''randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
... import re'''
>>> timeit.timeit(setup=setup, stmt='[re.sub("[^ACGT]+", "", s) for s in randomList]')
8.197133132976195
>>> timeit.timeit(setup=setup, stmt='[re.sub("[^ACGT]", "", s) for s in randomList]')
9.395620040786165

Without re, it's faster (see @cmd's answer):

>>> timeit.timeit(setup=setup, stmt="[''.join(c for c in s if c in 'ACGT') for s in randomList]")
6.874829817476666

Even faster (see @JonClement's comment):

>>> setup='''randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]\nascii_exclude = ''.join(set('ACGT').symmetric_difference(map(chr, range(256))))'''
>>> timeit.timeit(setup=setup, stmt="""[item.translate(None, ascii_exclude) for item in randomList]""")
2.814761871275735

Also possible:

>>> setup='randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]'
>>> timeit.timeit(setup=setup, stmt="[filter(set('ACGT').__contains__, item) for item in randomList]")
4.341086316883207
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Don't think the `+` needs to be there... – Jon Clements Feb 24 '14 at 21:06
  • @JonClements: It speeds up the match because the characters don't have to be replaced one by one. Will add some timings. – Tim Pietzcker Feb 24 '14 at 21:07
  • While it does make sense, for simple char replacements I wouldn't have thought there'd be such a difference. Thanks for taking the time to post the `timeit`s. – Jon Clements Feb 24 '14 at 21:28
  • 1
    I would be curious to see how something such as: `ascii_exclude = ''.join(set('ACGT').symmetric_difference(map(chr, range(256)))); for item in randomList: print item.translate(None, ascii_exclude)` performs... – Jon Clements Feb 24 '14 at 21:33
  • 1
    Possibly also the rather nasty (but avoids a join)... `filter(set('ACGT').__contains__, the_string)` – Jon Clements Feb 24 '14 at 21:36
  • @JonClements: I added the timings. Excellent ideas! Well, I'm off to bed now. G'night. – Tim Pietzcker Feb 24 '14 at 21:44
4

re is overkill for this

randomList2 = [''.join(c for c in s if c in 'ACGT') for s in randomList]

and if you dont want the ones that didn't initially have junk

valid = set("ACGT")
randomList2 = [''.join(c for c in s if c in valid) for s in randomList if any(c2 not in valid for c2 in s)]
cmd
  • 5,754
  • 16
  • 30
0

You can use regular expressions:

import re
randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
nonACGT = re.compile('[^ACGT]')
for i in range(len(randomList)):
    randomList[i] = nonACGT.sub('', randomList[i])
print randomList
Al Sweigart
  • 11,566
  • 10
  • 64
  • 92