everyone. This problem has already been asked by others. Splitting dictionary/list inside a Pandas Column into Separate Columns
I have already asked this question. But it doesn't be resolved. How to use pandas to build a column which are in a dataframe
Now, I have a dataframe. It looks like this.
intron_id octamer
0 >ENSG00000183943.1 AGCCATGC:1 AGUAGCUG:1 GCCUGGCC:1 AGAUGAUG:1 AG...
1 >ENSG00000183943.2 CATATTTC:1 UCCCAAAA:1 AAGCCATA:1 TATTTTGC:1 TA...
2 >ENSG00000183943.3 AGUAGCUG:4 UCAACAGG:1 CCUUUCAU:1 UACCUUUU:1 GC...
3 >ENSG00000183943.4 AUGAGCAC:1 UCCUACGG:1 GGAGGATC:1 AUAGGGUG:1 CC...
4 >ENSG00000183943.5 UUGCCAAU:1 AUGCUGGG:1 ACUAUUUU:1 GGAGGATC:3 UG...
Now, I want to transform it as this.
intron_id AGCCATGA AGUAGCUG GCCUGGCC ......
>ENSG00000183943.1 1 1 1
>ENSG00000183943.2 0 0 0
>ENSG00000183943.3 0 0 0
But when I tried to use apply(pd.Series) or df.octamer.values.tolist() , both of them don't work. I am confused. Hope you can give me some advices. Thank you in advance. My code is as follows.
import pandas as pd
df=pd.read_csv('~/10genomic/elife/octamer/intron_seq/count.txt',delimiter='\t',header=None)
df.rename(columns={0:"intron_id",1:"octamer"},inplace=True)
df['octamer']=df['octamer'].apply(lambda x:str(x))
print(df)
intron_id octamer
0 >ENSG00000183943.1 AGCCATGC:1 AGUAGCUG:1 GCCUGGCC:1 AGAUGAUG:1 AG...
1 >ENSG00000183943.2 CATATTTC:1 UCCCAAAA:1 AAGCCATA:1 TATTTTGC:1 TA...
2 >ENSG00000183943.3 AGUAGCUG:4 UCAACAGG:1 CCUUUCAU:1 UACCUUUU:1 GC...
3 >ENSG00000183943.4 AUGAGCAC:1 UCCUACGG:1 GGAGGATC:1 AUAGGGUG:1 CC...
4 >ENSG00000183943.5 UUGCCAAU:1 AUGCUGGG:1 ACUAUUUU:1 GGAGGATC:3 UG...
df.drop(labels=[2370,3967,5728,11875,14464],axis=0,inplace=True)
def builddict(x):
dictls=[]
for item in x.split(" "):
dictls.append(item.split(":"))
return(dict(dictls))
df['octamer']=df['octamer'].apply(builddict)
print(df)
intron_id octamer
0 >ENSG00000183943.1 {'AGCCATGC': '1', 'AGUAGCUG': '1', 'GCCUGGCC':...
1 >ENSG00000183943.2 {'CATATTTC': '1', 'UCCCAAAA': '1', 'AAGCCATA':...
2 >ENSG00000183943.3 {'AGUAGCUG': '4', 'UCAACAGG': '1', 'CCUUUCAU':...
3 >ENSG00000183943.4 {'AUGAGCAC': '1', 'UCCUACGG': '1', 'GGAGGATC':...
4 >ENSG00000183943.5 {'UUGCCAAU': '1', 'AUGCUGGG': '1', 'ACUAUUUU':...
print(df['octamer'].apply(pd.Series))
0
0 {'AGCCATGC': '1', 'AGUAGCUG': '1', 'GCCUGGCC':...
1 {'CATATTTC': '1', 'UCCCAAAA': '1', 'AAGCCATA':...
2 {'AGUAGCUG': '4', 'UCAACAGG': '1', 'CCUUUCAU':...
3 {'AUGAGCAC': '1', 'UCCUACGG': '1', 'GGAGGATC':...
4 {'UUGCCAAU': '1', 'AUGCUGGG': '1', 'ACUAUUUU':...
When I tried to solve it as follow, it produced this wrong. I really confuesd.
df=pd.read_csv('~/10genomic/elife/octamer/intron_seq/countdict.txt',delimiter=',',index_col=0)
df=df.iloc[:3,:]
print(df)
intron_id octamer
0 >ENSG00000183943.1 {'AGCCATGC': '1', 'AGUAGCUG': '1', 'GCCUGGCC':...
1 >ENSG00000183943.2 {'CATATTTC': '1', 'UCCCAAAA': '1', 'AAGCCATA':...
2 >ENSG00000183943.3 {'AGUAGCUG': '4', 'UCAACAGG': '1', 'CCUUUCAU':...
temp_df=pd.DataFrame.from_records(df.pop("octamer"))
print(temp_df)
0 1 2 3 4 5 ... 73895 73896 73897 73898 73899 73900
0 { ' A G C C ... None None None None None None
1 { ' C A T A ... None None None None None None
2 { ' A G U A ... : ' 1 ' }