Python - removing everything from a string except certain characters

Question

Not sure if this question has been asked before, but I couldn't find it, so here it is:

randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
randomList2 = []
for i in randomList:
  if i <contains any characters other than "A",C","G", or "T">:
    <add a string without junk to randomList2>

How would I do all the things within <>? Thanks,

http://stackoverflow.com/a/10017169/2282538 – Tyler Feb 24 '14 at 21:04 — Tyler, Feb 24 '14 at 21:04

Tim Pietzcker · Accepted Answer · 2014-02-24T21:40:59.200

4

>>> randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
>>> import re
>>> [re.sub("[^ACGT]+", "", s) for s in randomList]
['ACGT', 'AG', 'AGCT']

[^ACGT]+ matches one or more (+) characters except ACGT.

Some timings:

>>> import timeit
>>> setup = '''randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
... import re'''
>>> timeit.timeit(setup=setup, stmt='[re.sub("[^ACGT]+", "", s) for s in randomList]')
8.197133132976195
>>> timeit.timeit(setup=setup, stmt='[re.sub("[^ACGT]", "", s) for s in randomList]')
9.395620040786165

Without re, it's faster (see @cmd's answer):

>>> timeit.timeit(setup=setup, stmt="[''.join(c for c in s if c in 'ACGT') for s in randomList]")
6.874829817476666

Even faster (see @JonClement's comment):

>>> setup='''randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]\nascii_exclude = ''.join(set('ACGT').symmetric_difference(map(chr, range(256))))'''
>>> timeit.timeit(setup=setup, stmt="""[item.translate(None, ascii_exclude) for item in randomList]""")
2.814761871275735

Also possible:

>>> setup='randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]'
>>> timeit.timeit(setup=setup, stmt="[filter(set('ACGT').__contains__, item) for item in randomList]")
4.341086316883207

edited Feb 24 '14 at 21:40

answered Feb 24 '14 at 21:05

Tim Pietzcker

328,213
58
503
561

Don't think the `+` needs to be there... – Jon Clements Feb 24 '14 at 21:06
@JonClements: It speeds up the match because the characters don't have to be replaced one by one. Will add some timings. – Tim Pietzcker Feb 24 '14 at 21:07
While it does make sense, for simple char replacements I wouldn't have thought there'd be such a difference. Thanks for taking the time to post the `timeit`s. – Jon Clements Feb 24 '14 at 21:28
1

I would be curious to see how something such as: `ascii_exclude = ''.join(set('ACGT').symmetric_difference(map(chr, range(256)))); for item in randomList: print item.translate(None, ascii_exclude)` performs... – Jon Clements Feb 24 '14 at 21:33
1

Possibly also the rather nasty (but avoids a join)... `filter(set('ACGT').__contains__, the_string)` – Jon Clements Feb 24 '14 at 21:36
@JonClements: I added the timings. Excellent ideas! Well, I'm off to bed now. G'night. – Tim Pietzcker Feb 24 '14 at 21:44

cmd · Answer 2 · 2014-02-24T21:22:21.087

4

re is overkill for this

randomList2 = [''.join(c for c in s if c in 'ACGT') for s in randomList]

and if you dont want the ones that didn't initially have junk

valid = set("ACGT")
randomList2 = [''.join(c for c in s if c in valid) for s in randomList if any(c2 not in valid for c2 in s)]

edited Feb 24 '14 at 21:22

answered Feb 24 '14 at 21:12

cmd

5,754
16
30

1

Good point, and very elegant. Also faster (see my edited answer). – Tim Pietzcker Feb 24 '14 at 21:26

score 0 · Answer 3 · answered Feb 24 '14 at 21:05

0

You can use regular expressions:

import re
randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]
nonACGT = re.compile('[^ACGT]')
for i in range(len(randomList)):
    randomList[i] = nonACGT.sub('', randomList[i])
print randomList

answered Feb 24 '14 at 21:05

Al Sweigart

11,566
10
64
92

Python - removing everything from a string except certain characters

3 Answers3