From my searches I've seen that this sort of question has been asked multiple times and I understand the canonical solutions. However, none address the specific issue I'm having. I'm attempting to write a function to strip nonsense characters from a string with RegEx in Python 3.4 (Chromosome IDs from any number of different types of bioinformatics files I'm working with). There are no general rules about what sorts of strange characters might be present, so the idea is to write this in such a way that new special cases could be added quickly and I've included a few examples in my code below.
Following the logic from several other posts:
How can I do multiple substitutions using regex in python?
Efficiently carry out multiple string replacements in Python
multiple regex substitution in multiple files using python
Multiple, specific, regex substitutions in Python
Python replace multiple strings
etc...
I have written the following:
def fix_chromosome_id(chromosome):
replacements = OrderedDict([(r'lcl|', ''),
(r'gi|', ''),
(r'chromosome', ''),
(r'^chr', ''),
(r'_+', ''),
(r'\s+', ''),
(r'^\s', ''),
(r'\s$', ''),
(r'/', '_'),
(r'|$', ''),
(r'|', '_'),
(r'(', '_'),
(r')', '_'),
(r'_+', '_')]) # Ordered dictionary of regex and substitutions from list of tuples
# Compile as regex objects, substitute regex as specified in the ordered dictionary
pattern = re.compile('|'.join(re.escape(regex) for regex in replacements))
chromosome = pattern.sub(lambda match: replacements[match.group(0)], chromosome, re.IGNORECASE)
You can see I've created an ordered dictionary from a list of tuples, since the order of the replacements can matter and a standard dictionary would not take care of this. Then using the keys as RegEx and attempting to replace with their corresponding values.
My problems:
- Substituting case insensitively does not work ('chromosome12' but not 'CHROMOSOME12' is replaced with '12' despite the re.IGNORECASE)
- Substituting the beginning of a string does not work ('chr12' is not replaced by '12').
- Whitespace characters are not removed such as \s although they are included as raw strings.
No examples I have been able to find of using dictionary keys and values have in this way looked at behavior of these sorts of special characters.
However, if I write something like:
if re.search(r'^0+$', chromosome):
chromosome = 0
That works just fine for replacing a string of an arbitrary number of zeroes with one zero.
What then is the problem with the above code? If you'd be so kind as to take a look. I could type re.sub()
for each specific instance, but surely there is a more efficient way to do this.