0

From my searches I've seen that this sort of question has been asked multiple times and I understand the canonical solutions. However, none address the specific issue I'm having. I'm attempting to write a function to strip nonsense characters from a string with RegEx in Python 3.4 (Chromosome IDs from any number of different types of bioinformatics files I'm working with). There are no general rules about what sorts of strange characters might be present, so the idea is to write this in such a way that new special cases could be added quickly and I've included a few examples in my code below.

Following the logic from several other posts:

How can I do multiple substitutions using regex in python?

Efficiently carry out multiple string replacements in Python

multiple regex substitution in multiple files using python

Multiple, specific, regex substitutions in Python

Python replace multiple strings

etc...

I have written the following:

def fix_chromosome_id(chromosome):
    replacements = OrderedDict([(r'lcl|', ''),
                                (r'gi|', ''),
                                (r'chromosome', ''),
                                (r'^chr', ''),
                                (r'_+', ''),
                                (r'\s+', ''),
                                (r'^\s', ''),
                                (r'\s$', ''),
                                (r'/', '_'),
                                (r'|$', ''),
                                (r'|', '_'),
                                (r'(', '_'),
                                (r')', '_'),
                                (r'_+', '_')])  # Ordered dictionary of regex and substitutions from list of tuples

    # Compile as regex objects, substitute regex as specified in the ordered dictionary
    pattern = re.compile('|'.join(re.escape(regex) for regex in replacements))
    chromosome = pattern.sub(lambda match: replacements[match.group(0)], chromosome, re.IGNORECASE)

You can see I've created an ordered dictionary from a list of tuples, since the order of the replacements can matter and a standard dictionary would not take care of this. Then using the keys as RegEx and attempting to replace with their corresponding values.

My problems:

  1. Substituting case insensitively does not work ('chromosome12' but not 'CHROMOSOME12' is replaced with '12' despite the re.IGNORECASE)
  2. Substituting the beginning of a string does not work ('chr12' is not replaced by '12').
  3. Whitespace characters are not removed such as \s although they are included as raw strings.

No examples I have been able to find of using dictionary keys and values have in this way looked at behavior of these sorts of special characters.

However, if I write something like:

if re.search(r'^0+$', chromosome):
        chromosome = 0

That works just fine for replacing a string of an arbitrary number of zeroes with one zero.

What then is the problem with the above code? If you'd be so kind as to take a look. I could type re.sub() for each specific instance, but surely there is a more efficient way to do this.

Community
  • 1
  • 1
Jeff
  • 1
  • 2

2 Answers2

2

The plain list of 2-tuples that you're using to build the OrderedDict is probably a better data structure for this, since there is not a true key/value relationship between a pattern and its replacement. Also, you have a duplicated key, and this will appear only once in a dictionary! Leaving it as a list will use less memory, to boot (probably not a major factor though).

The main problem I see is that you're escaping your patterns programmatically. Therefore the special characters in your patterns don't have their special meanings. For example, + is changed to \+ by re.escape() which means it now matches a literal plus sign, not "one or more of the preceding character." This doesn't explain some of your problems (e.g. the case-insensitivity not working) but you're going to be very confused about everything until you fix this issue.

What you should probably do instead is escape the stuff that needs escaping in the original patterns (for example, I assume the | character in a pattern such as li| is intended to match a literal |, so should be written \|) and don't use re.escape().

Also, since you are not doing anything fancy with the replacement, you can just use re.sub() with the replacement text right in the call, rather than writing a lambda that does the same thing.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • One of those time when a small suggestion turns out to be extremely valuable! I was making this more complicated, less readable, and therefore less "pythonic" than I needed to with the sorted dictionary. I still don't quite understand exactly what all the issues were, but they are now fixed by doing away with that and the programmatic escaping. Many thanks. – Jeff Dec 04 '15 at 07:14
0

On the advice of kindall I simplified things somewhat. Lambdas are sometimes convenient, but in this case not necessary and it makes things less readable. The ordered dictionary was a nice idea, but unnecessary.

Solution:

def fix_chromosome_id(chromosome):
        replacements = [('lcl\|', ''),
                        ('gi\|', ''),
                        ('chromosome', ''),
                        ('^chr', ''),
                        ('^_+', ''),
                        ('\s+', ''),
                        ('^\s', ''),
                        ('\s$', ''),
                        ('\/', '_'),
                        ('\|$', ''),
                        ('\|', '_'),
                        ('\(', '_'),
                        ('\)', '_'),
                        ('_+', '')]  # Regex and substitutions from list of tuples

        # Compile as regex objects, substitute regex as specified in the ordered dictionary
        for rep_tuple in replacements:
            regex_pattern = re.compile(rep_tuple[0], re.IGNORECASE)
            rep = rep_tuple[1]
            chromosome = regex_pattern.sub(rep, chromosome)

Not sure why the re.IGNORECASE wasn't working before but everything is fine now.

Jeff
  • 1
  • 2