How do I split string into characters and replace characters with float values to find the sum of original string in Python?

Question

Hi,

Noobie to python here.

I have >10,000 strings that represent peptide sequences. Each letter in the string is an amino acid and I would like to calculate the "net sum" of the string after I have replaced each letter with a pre-defined float value (ranging from -1 to -2).

I am stuck on where to start with the loop to make this work? I have the code to clean the strings so that non-alphabetical characters are removed and replace with float values defined in a dictionary (i.e. W:2.10, G:-1.0)

cleaned peptides, truncated to 5 characters

I imagine the code is something like.

I have 6 dataframes to repeat this process in.

Any help would be immensely appreciated!

Updated Code (THIS WORKS THANKS TO SARAH MESSER)

def hydrophobicity_score(peptide):
    hydro = { 
        'A': -0.5,
        'C': -1.0,
        'D': 3.0,
        'E': 3.0,
        'F': -2.5,
        'G': 0.0,
        'H': -0.5,
        'I': -1.8,
        'K': 3.0,
        'L': -1.8,
        'M': -1.3,
        'N': 0.2,
        'P': 0.0,
        'Q': 0.2,
        'R': 3.0,
        'S': 0.3,
        'T': -0.4,
        'V': -1.5,
        'W': -3.4,
        'Y': -2.3,
    }
    hydro_score = [hydro.get(aa,0.0)for aa in peptide]
    return sum(hydro_score)

og_pep['Hydro'] = og_pep['Peptide'].apply(hydrophobicity_score)
og_pep

When you convert to the float values, are the values in a list? Are they of type `float`? — Blake G, Dec 10 '20 at 17:16
Can you provide a snippet (3-5 elements) of the input and what the corresponding desired output is? Maybe also the names and dtypes of your dataframe's columns - or at least the ones involved in this transformation? — Sarah Messer, Dec 10 '20 at 17:16
Why doesn't your loop use the `peptide` variable? However, it's not normally needed to write a loop to process all rows in a Pandas series. — Barmar, Dec 10 '20 at 17:17
Do you have a dictionary of the correlation of letter to floats? — Blake G, Dec 10 '20 at 17:17
You should have a function that takes a single truncated peptide sequence and sums that single sequence before you do it for all rows. Do you have that? — Blake G, Dec 10 '20 at 17:22
`sum(d[char] for char in peptide)` will calculate the sum of all the mappings in the dictionary `d`. — Barmar, Dec 10 '20 at 17:23
You don't even need to remove the non-alphas: `sum(d.get(char, 0) for char in peptide)` — Barmar, Dec 10 '20 at 17:24
Good. If you place that into a function, you can apply it to all rows in a DataFrame to create a new column of sums. — Blake G, Dec 10 '20 at 17:25
see: https://stackoverflow.com/questions/34962104/how-can-i-use-the-apply-function-for-a-single-column — Blake G, Dec 10 '20 at 17:38

Sarah Messer · Accepted Answer · 2020-12-10T19:16:02.400

Okay, first up, you don't want to loop over the rows in a dataframe. The rows are designed to be processed in parallel. Getting your head around that is a bit of a stretch, but once you've defined a few row-level operations and applied them to large dataframes, it'll get smoother. (The problem with looping over rows is one of speed. It's sometimes useful in debugging or toy problems, but modern computing hardware tries to parallelize computations as much as possible. Dataframes take advantage of that to process all the rows at once, rather than handling them individually in a loop.)

To do the conversion, you're going to need to define a custom function to operate on each individual row. Then you pass that custom function to the dataframe and tell it to apply that row-level function to one column in order to generate a new column.

So here's a possible function to get you started:

def peptide_score(peptide_string):
    '''Returns a numerical score given a sequence of peptide characters.'''
    # Replace the values in this dict (dictionary / map) with whatever values you need
    amino_acid_scores = { 
        'A': 0.1,
        'C': 1.4,
        'G': 0.32342,
        'T': -0.23,
        'U': 74.22
    }
    # This is called a "list comprehension." It's great for transforming sequences.
    score_list = [amino_acid_scores[character] for character in peptide_string]
    return sum(score_list)

# I'm assuming your pre-existing dataframe is called "gluc_dataframe" and that the
# column with your strings is called "Peptide".  Output scores will be stored in a new
# column, "score". Replace those names with whatever fits.
gluc_dataframe['score'] = gluc_dataframe['Peptide'].apply(peptide_score)

If you've got a lot of characters you want to ignore (whitespace, punctuation, whatever), you can replace amino_acid_scores[character] in the list comprehension with amino_acid_scores.get(character, 0.0).

Beautiful. Thank you so much. This was how I envisioned @Barmar 's answer. I will post the result when completed. — thejahcoop, Dec 10 '20 at 17:42
this is actually great because I can do this before the data is split into 6 dataframes. — thejahcoop, Dec 10 '20 at 17:44
I have not had luck with this answer. I will post update in separate answer. — thejahcoop, Dec 10 '20 at 18:51
I missed the part about ignoring (whatever). It was trying to list comprehend the punctionation. — thejahcoop, Dec 10 '20 at 19:59
Please accept this answer and upvote if it worked as advertised. — Sarah Messer, Dec 10 '20 at 20:19

score 0 · Answer 2 · answered Dec 10 '20 at 18:53

def hydrophobicity_score(peptide):
     hydro = { 
        'A': -0.5,
        'C': -1.0,
        'D': 3.0,
        'E': 3.0,
        'F': -2.5,
        'G': 0.0,
        'H': -0.5,
        'I': -1.8,
        'K': 3.0,
        'L': -1.8,
        'M': -1.3,
        'N': 0.2,
        'P': 0.0,
        'Q': 0.2,
        'R': 3.0,
        'S': 0.3,
        'T': -0.4,
        'V': -1.5,
        'W': -3.4,
        'Y': -2.3,
    }
    hydro_score = [hydro[aa] for aa in peptide]
    return sum(hydro_score)

og_peptide= og_pep['Peptide']
og_peptide = og_peptide.str.replace('\W+','')
og_peptide = og_peptide.str.replace('\d+','')
og_peptide = pd.DataFrame(og_peptide)
og_peptide['Hydro_Score'] = og_peptide.apply(hydrophobicity_score)
og_peptide

I am not getting the expected output.

Output

Here is og_pep DataFrame

you're applying the function to a dataframe, rather than to a series. Thus you list comprehension loops over the values in the DF, rather than the characters in a single string. — Sarah Messer, Dec 10 '20 at 19:12
Got it, thanks. I have updated the code in the original post to show the correct answer. — thejahcoop, Dec 10 '20 at 19:56

How do I split string into characters and replace characters with float values to find the sum of original string in Python?

2 Answers2