0

everyone. This problem has already been asked by others. Splitting dictionary/list inside a Pandas Column into Separate Columns

I have already asked this question. But it doesn't be resolved. How to use pandas to build a column which are in a dataframe

Now, I have a dataframe. It looks like this.

                     intron_id                                            octamer
0       >ENSG00000183943.1  AGCCATGC:1 AGUAGCUG:1 GCCUGGCC:1 AGAUGAUG:1 AG...
1       >ENSG00000183943.2  CATATTTC:1 UCCCAAAA:1 AAGCCATA:1 TATTTTGC:1 TA...
2       >ENSG00000183943.3  AGUAGCUG:4 UCAACAGG:1 CCUUUCAU:1 UACCUUUU:1 GC...
3       >ENSG00000183943.4  AUGAGCAC:1 UCCUACGG:1 GGAGGATC:1 AUAGGGUG:1 CC...
4       >ENSG00000183943.5  UUGCCAAU:1 AUGCUGGG:1 ACUAUUUU:1 GGAGGATC:3 UG...

Now, I want to transform it as this.

    intron_id   AGCCATGA AGUAGCUG  GCCUGGCC ......
>ENSG00000183943.1  1      1         1
>ENSG00000183943.2  0      0        0
>ENSG00000183943.3   0      0         0

But when I tried to use apply(pd.Series) or df.octamer.values.tolist() , both of them don't work. I am confused. Hope you can give me some advices. Thank you in advance. My code is as follows.

    import pandas as pd

df=pd.read_csv('~/10genomic/elife/octamer/intron_seq/count.txt',delimiter='\t',header=None)
df.rename(columns={0:"intron_id",1:"octamer"},inplace=True)
df['octamer']=df['octamer'].apply(lambda x:str(x))
print(df)


                 intron_id                                            octamer
0       >ENSG00000183943.1  AGCCATGC:1 AGUAGCUG:1 GCCUGGCC:1 AGAUGAUG:1 AG...
1       >ENSG00000183943.2  CATATTTC:1 UCCCAAAA:1 AAGCCATA:1 TATTTTGC:1 TA...
2       >ENSG00000183943.3  AGUAGCUG:4 UCAACAGG:1 CCUUUCAU:1 UACCUUUU:1 GC...
3       >ENSG00000183943.4  AUGAGCAC:1 UCCUACGG:1 GGAGGATC:1 AUAGGGUG:1 CC...
4       >ENSG00000183943.5  UUGCCAAU:1 AUGCUGGG:1 ACUAUUUU:1 GGAGGATC:3 UG...

df.drop(labels=[2370,3967,5728,11875,14464],axis=0,inplace=True)


def builddict(x):
    dictls=[]
    for item in x.split(" "):
        dictls.append(item.split(":"))
    return(dict(dictls))

df['octamer']=df['octamer'].apply(builddict)
print(df)


                intron_id                                            octamer
0       >ENSG00000183943.1  {'AGCCATGC': '1', 'AGUAGCUG': '1', 'GCCUGGCC':...
1       >ENSG00000183943.2  {'CATATTTC': '1', 'UCCCAAAA': '1', 'AAGCCATA':...
2       >ENSG00000183943.3  {'AGUAGCUG': '4', 'UCAACAGG': '1', 'CCUUUCAU':...
3       >ENSG00000183943.4  {'AUGAGCAC': '1', 'UCCUACGG': '1', 'GGAGGATC':...
4       >ENSG00000183943.5  {'UUGCCAAU': '1', 'AUGCUGGG': '1', 'ACUAUUUU':...

print(df['octamer'].apply(pd.Series))


                                                      0
0      {'AGCCATGC': '1', 'AGUAGCUG': '1', 'GCCUGGCC':...
1      {'CATATTTC': '1', 'UCCCAAAA': '1', 'AAGCCATA':...
2      {'AGUAGCUG': '4', 'UCAACAGG': '1', 'CCUUUCAU':...
3      {'AUGAGCAC': '1', 'UCCUACGG': '1', 'GGAGGATC':...
4      {'UUGCCAAU': '1', 'AUGCUGGG': '1', 'ACUAUUUU':...

When I tried to solve it as follow, it produced this wrong. I really confuesd.

    df=pd.read_csv('~/10genomic/elife/octamer/intron_seq/countdict.txt',delimiter=',',index_col=0)
df=df.iloc[:3,:]
print(df)
            intron_id                                            octamer
0  >ENSG00000183943.1  {'AGCCATGC': '1', 'AGUAGCUG': '1', 'GCCUGGCC':...
1  >ENSG00000183943.2  {'CATATTTC': '1', 'UCCCAAAA': '1', 'AAGCCATA':...
2  >ENSG00000183943.3  {'AGUAGCUG': '4', 'UCAACAGG': '1', 'CCUUUCAU':...

temp_df=pd.DataFrame.from_records(df.pop("octamer"))
print(temp_df)
0     1     2     3     4     5      ... 73895 73896 73897 73898 73899 73900
0     {     '     A     G     C     C  ...  None  None  None  None  None  None
1     {     '     C     A     T     A  ...  None  None  None  None  None  None
2     {     '     A     G     U     A  ...     :           '     1     '     }
ruiyan hou
  • 11
  • 4

2 Answers2

0

What you might try to do is loop through both columns and create separate dictionaries. This solution isn't the most efficient, but I think it will get the job done.

I would create a dictionary of lists of all the keys:

dict = {'intron_id':[], 'AGCCATGC':[], etc..}

Then loop through each row of the df and also loop through the dictionary:

for index, row in df.iterrows():
    dict['intron_id'].append(row['intron_id'])
    for key, value in row['octamer'].iteritems():
        dict[key].append(value)

Then recreate the dataframe with the dictionaries. Pandas DataFrame from Dictionary with Lists

That should do the job. There might be some syntax issues, but hopefully that'll help you out.

Om Chabra
  • 35
  • 4
  • Ok, thank you. I would try your solution. Thank you for your help! – ruiyan hou Sep 06 '20 at 01:10
  • @ruiyanhou Okay, let me know how it goes. I am curious to see if this works. If it doesn't, I am happy to help you out. If it does work, could you mark it as correct? Thanks a lot! – Om Chabra Sep 06 '20 at 08:24
0

First some fake data:

import pandas as pd


df = pd.DataFrame(data={
    "intron_id": ["A", "B"],
    "octamer": [{'AGCCATGC': '1', 'AGUAGCUG': '1'}, {'CATATTTC': '1', 'UCCCAAAA': '1'}],
})

print(df)

enter image description here

Then extract the column containing the dicts and pass it to from_records to create our new dataframe

temp_df = pd.DataFrame.from_records(df.pop("octamer"))
print(temp_df)

enter image description here

Finally merge our original dataframe with our new one, using indices as the keys

df = df.merge(temp_df, left_index=True, right_index=True)
print(df)

Et voilà! (at least I think this is what you wanted anyway) enter image description here

Edit
I get similarly garbled results if my values are really just strings that look like dictionaries. You can confirm this by extracting a single value from your column and checking its type:

df = pd.read_csv('~/10genomic/elife/octamer/intron_seq/countdict.txt',delimiter=',',index_col=0)
df = df.iloc[:3,:]
first_octamer = df["octamer"][0]
print(type(first_octamer))

I suspect you will see <class 'str'> as the result. If so, we can use ast.literal_eval to try to parse the strings as actual dictionaries:

temp_df = df.pop("octamer")
temp_df = temp_df.apply(ast.literal_eval)
temp_df = pd.DataFrame.from_records(temp_df)

df = df.merge(temp_df, left_index=True, right_index=True)
print(df)

I hope your strings are well-formatted though, else things might get tricky...

Edunne
  • 224
  • 1
  • 6
  • Thank you for your help. But when I tried your method. It doesn't work on my data. My process is shown in the last answer. I don't know where the problem is. Hope to get your help! – ruiyan hou Sep 06 '20 at 02:53
  • thanks for your help! But when I tried your method, it does not work for my data. Hope to get your answer! The process is in the last of my question! – ruiyan hou Sep 06 '20 at 03:02