Remove characters from the end of each element in a list of strings based on another list of strings (e.g. blacklist strings)

Question

I have a dictionary which contains a number of unique string values for a key "sample". I'm converting this key "sample" into a list for plotting, however I want to generate another list with an equal number of elements that strip certain strings at the end of each element to generate a "clean" list that can then group certain samples together for plotting. For example, my blacklist looks like:

blacklist = ['_001', '_002', '_003', '_004', '_005', '_006', '_007', '_008', '_009', \
                       '_01', '_02', '_03', '_04', '_05', '_06', '_07', '_08', '_09', \
                       '_1', '_2', '_3', '_4', '_5', '_6', '_7', '_8', '_9']

which I want to remove from each item in this example list generated from my dictionary:

sample = [(d['sample']) for d in my_stats]
sample
['sample_A', 'sample_A_001', 'sample_A_002', 'my_long_sample_B_1', 'other_sample_C_08', 'sample_A_03', 'sample1_D_07']

with the desired result of a new list:

sample
['sample_A', 'sample_A', 'sample_A', 'my_long_sample_B', 'other_sample_C', 'sample_A', 'sample1_D']

For context, I understand there will be some elements that will then be the same -- I want to use this list to compile a dataframe in conjunction with lists with an equal number of values generated other keys from this dictionary that will be used as an id in plotting (i.e. such that I can use it to group/color all of those values the same). Note that there may be various numbers of underscores and there may be elements in my list of strings that do not contain any values from the blacklist (which is why I can't use some variant of split on the last underscore for example).

This is similar to this issue: How can I remove multiple characters in a list?

but I don't want it to be so generalized/greedy and would ideally like to remove it from only the end as the user may have an input file with parts of these strings (e.g. the 1 in sample1_D) internally. I don't necessarily need to use a blacklist if there's another solution, it just seemed like that might be the easiest way.

score 3 · Answer 1 · answered Oct 14 '19 at 21:14

3

Use regex.

import re

pattern = '|'.join(blacklist)
[re.sub(pattern+'$', '', x) for x in sample]

Output:

['sample_A',
 'sample_A',
 'sample_A',
 'my_long_sample_B',
 'other_sample_C',
 'sample_A',
 'sample1_D']

answered Oct 14 '19 at 21:14

CypherX

7,019
3
25
37

score 1 · Accepted Answer · answered Oct 14 '19 at 21:13

Here you go, see if this fits your requirements.

Basically, you're just splitting on the '_' character and testing if the last split in the list is in your blacklist. If True, then drop it, if False put the string back together; and build a new list from the results.

blacklist = ['_001', '_002', '_003', '_004', '_005', '_006', '_007', '_008',
             '_01', '_02', '_03', '_04', '_05', '_06', '_07', '_08', '_09',
             '_1', '_2', '_3', '_4', '_5', '_6', '_7', '_8', '_9']
sample = ['sample_A', 'sample_A_001', 'sample_A_002', 'my_long_sample_B_1',
          'other_sample_C_08', 'sample_A_03', 'sample1_D_07']
results = []

for i in sample:
    splt = i.split('_')
    value = '_'.join(splt[:-1]) if '_{}'.format(splt[-1:][0]) in blacklist else '_'.join(splt)
    results.append(value)

print(results)

Output:

['sample_A', 'sample_A', 'sample_A', 'my_long_sample_B', 'other_sample_C', 'sample_A', 'sample1_D']

score 1 · Answer 3 · answered Oct 14 '19 at 21:16

You could use sub from regex:

import re
from functools import partial

blacklist = ['_001', '_002', '_003', '_004', '_005', '_006', '_007', '_008', '_009',
             '_01', '_02', '_03', '_04', '_05', '_06', '_07', '_08', '_09',
             '_1', '_2', '_3', '_4', '_5', '_6', '_7', '_8', '_9']


def sub(match, bl=None):
    if match.group() in bl:
        return ""
    return match.group()


repl = partial(sub, bl=set(blacklist))

sample = ['sample_A', 'sample_A_001', 'sample_A_002', 'my_long_sample_B_1', 'other_sample_C_08', 'sample_A_03',
          'sample1_D_07']

print([re.sub("_[^_]+?$", repl, si) for si in sample])

Output

['sample_A', 'sample_A', 'sample_A', 'my_long_sample_B', 'other_sample_C', 'sample_A', 'sample1_D']

See why this is the way to go, if you want speed, here.

score 1 · Answer 4 · answered Oct 14 '19 at 21:17

You can loop through your sample list, if the last char of the element is a digit then you can loop through your blacklist items checking if the string ends with that. If it does then you can strip the blacklist item from the string and reassign the result to the sample list.

blacklist = [
    '_001', '_002', '_003', '_004', '_005', '_006', '_007', '_008', '_009',
    '_01', '_02', '_03', '_04', '_05', '_06', '_07', '_08', '_09',
    '_1', '_2', '_3', '_4', '_5', '_6', '_7', '_8', '_9'
]

sample = ['sample_A', 'sample_A_001', 'sample_A_002', 'my_long_sample_B_1', 'other_sample_C_08', 'sample_A_03', 'sample1_D_07']

for index, item in enumerate(sample):
    #check if the last char is a digit, if its not then it cant be in our black list so no point checking
    if item[-1].isdigit():
        for black in blacklist:
            if item.endswith(black):
                sample[index] = item.rstrip(black)

print(sample)

OUTPUT

['sample_A', 'sample_A', 'sample_A', 'my_long_sample_B', 'other_sample_C', 'sample_A', 'sample1_D']

Remove characters from the end of each element in a list of strings based on another list of strings (e.g. blacklist strings)

4 Answers4

Output: