Pandas create columns from dictionary-returning function applied to a column

Question

Problem

I have a function that takes as input a str and returns a dict. I would like to apply this function to a specific column of every row of a pandas dataframe, and have it create new columns with the returned dictionary

Function

example = "TGGCCCGCGAACTTGCCCGAAGCCCTCGTTCCCTGTCGGCTCTAACCGCTGGTGTAGTGG[CG]GAGCACGCGAACTTAGCAAGGGCTAAGCGATCAGGAATAAGAACAGCAGGAAAGCCAGAG"

def freqcount(s):
    bases = "".join(s.split("[CG]"))
    total = len(bases)
    outdic = {}
    for b1 in ["A", "G", "C", "T"]:
        outdic[b1] = bases.count(b1)/total
        for b2 in ["A", "G", "C", "T"]:
            outdic[b1+b2] = bases.count(b1+b2)/total
    return outdic

print(freqcount(example))

{'A': 0.25833333333333336, 'AA': 0.08333333333333333, 'AG': 0.10833333333333334, 'AC': 0.041666666666666664, 'AT': 0.016666666666666666, 'G': 0.30833333333333335, 'GA': 0.075, 'GG': 0.058333333333333334, 'GC': 0.10833333333333334, 'GT': 0.041666666666666664, 'C': 0.275, 'CA': 0.05, 'CG': 0.075, 'CC': 0.05, 'CT': 0.06666666666666667, 'T': 0.15833333333333333, 'TA': 0.041666666666666664, 'TG': 0.05, 'TC': 0.041666666666666664, 'TT': 0.025}

Dataframe

print(df_dna)

                                                  Forward_Sequence
cg00050873       TATCTCTGTCTGGCGAGGAGGCAACGCACAACTGTGGTGGTTTTTG...
cg00212031       CCATTGGCCCGCCCCAGTTGGCCGCAGGGACTGAGCAAGTTATGCG...
cg00213748       TCTGTGGGACCATTTTAACGCCTGGCACCGTTTTAACGATGGAGGT...
cg00214611       GCGCCGGCAGGACTAGCTTCCGGGCCGCGCTTTGTGTGCTGGGCTG...
cg00455876       CGCGTGTGCCTGGACTCTGAGCTACCCGGCACAAGCTCCAAGGGCT...
...                                                            ...
ch.22.909671F    TTTTCCTTTTAGCTGCTGATAGATTAATAGTATGTGAACCTTTTAA...
ch.22.46830341F  TGTGCATACATGCGCATGTGAACAGTCCATGGAGCTTAATCCCCTG...
ch.22.1008279F   CTGGCAGGGCACACACCTCAGCTGGGCCCTGTGGCAGGTGAACCCC...
ch.22.47579720R  ATGTACCCATACGGGAAAGGCCGCGTGAAGATGGAGACAGAGATGG...
ch.22.48274842R  AGTGTAGAATTTGGGGCTCGCCCTGTTGGTTCCTCCGGTGTGAAGG...

[485512 rows x 1 columns]

Expected output

I would like to have new columns, A, AA, AG, ..., and have the dictionary values in the correct column for each row.

Output i get so far

However this is what I get.

print(df_dna.applymap(freqcount))

                                             Forward_Sequence
cg00050873  {'A': 0.21666666666666667, 'AA': 0.04166666666...
cg00212031  {'A': 0.21666666666666667, 'AA': 0.04166666666...
cg00213748  {'A': 0.18333333333333332, 'AA': 0.01666666666...
cg00214611  {'A': 0.14166666666666666, 'AA': 0.025, 'AG': ...
cg00455876  {'A': 0.15, 'AA': 0.025, 'AG': 0.0833333333333...
cg01707559  {'A': 0.10833333333333334, 'AA': 0.01666666666...
cg02004872  {'A': 0.13333333333333333, 'AA': 0.0, 'AG': 0....
cg02011394  {'A': 0.175, 'AA': 0.016666666666666666, 'AG':...
cg02050847  {'A': 0.175, 'AA': 0.025, 'AG': 0.05, 'AC': 0....
cg02233190  {'A': 0.225, 'AA': 0.03333333333333333, 'AG': ...

I get same result with

print(df_dna.apply(lambda row: freqcount(row["Forward_Sequence"]), axis=1))

Anyone have an idea how I can achieve the desired result?

score 1 · Accepted Answer · answered Oct 13 '21 at 13:55

Actually just found the answer, using the result_type argument:

df_dna.apply(lambda row: freqcount(row["Forward_Sequence"]), axis=1, result_type="expand")

                   A        AA        AG        AC        AT         G \
cg00050873  0.216667  0.041667  0.091667  0.058333  0.008333  0.400000   
cg00212031  0.216667  0.041667  0.100000  0.050000  0.016667  0.391667   
cg00213748  0.183333  0.016667  0.075000  0.050000  0.041667  0.416667   
cg00214611  0.141667  0.025000  0.091667  0.016667  0.000000  0.400000   
cg00455876  0.150000  0.025000  0.083333  0.033333  0.008333  0.425000   
cg01707559  0.108333  0.016667  0.058333  0.008333  0.016667  0.291667   
cg02004872  0.133333  0.000000  0.075000  0.025000  0.033333  0.325000   
cg02011394  0.175000  0.016667  0.066667  0.075000  0.008333  0.258333   
cg02050847  0.175000  0.025000  0.050000  0.033333  0.058333  0.241667   
cg02233190  0.225000  0.033333  0.133333  0.008333  0.033333  0.316667
...

from: https://stackoverflow.com/a/52363890/9439097

Quang Hoang · Answer 2 · 2021-10-13T14:00:21.903

0

Try pd.DataFrame:

df.join(pd.DataFrame(df['Forward_Sequence'].apply(freqcount).to_list(), df.index))

edited Oct 13 '21 at 14:00

answered Oct 13 '21 at 13:54

Quang Hoang

146,074
10
56
74

This did not really give the expected result in my case. I found the 'correct way to do it' anyway in another answer, but thank you for your help! – charelf Oct 13 '21 at 13:58
1

Use `to_list()` instead of `values`. Your answer works fine as well. – Quang Hoang Oct 13 '21 at 14:00