split string and add to new row

Question

i have a list df that looks like this:

>df
gene_id GO
LOC_Os02g43120.1    GO:0008270 GO:0005515
LOC_Os12g21850.1    GO:0003700 GO:0006355 GO:0005515 GO:0034645 GO:0043565
LOC_Os06g30330.1    GO:0005488
LOC_Os07g37690.1    GO:0016758 GO:0008152

Im not sure how to put it in the right word but i would like all the gene_ids with more than one GO to be separated into a different row.

Expected output:

GO:0008270  LOC_Os02g43120.1
GO:0005515  LOC_Os02g43120.1
GO:0003700  LOC_Os12g21850.1
GO:0006355  LOC_Os12g21850.1
GO:0005515  LOC_Os12g21850.1
GO:0034645  LOC_Os12g21850.1
GO:0043565  LOC_Os12g21850.1
GO:0005488  LOC_Os06g30330.1
GO:0016758  LOC_Os07g37690.1
GO:0008152  LOC_Os07g37690.1

Any help will be appreciated. Thank you

Jason Baker · Accepted Answer · 2023-08-15T04:44:03.827

Using chained operations .split(), .assign(), .explode():

Data:

import pandas as pd

data = {
    "gene_id": ["LOC_Os02g43120.1", "LOC_Os12g21850.1", "LOC_Os06g30330.1", "LOC_Os07g37690.1"],
    "GO": ["GO:0008270 GO:0005515", "GO:0003700 GO:0006355 GO:0005515 GO:0034645 GO:0043565", "GO:0005488",
           "GO:0016758 GO:0008152"]
}

Code:

df = (pd
      .DataFrame(data=data)
      .assign(GO=lambda x: x.GO.str.split(pat=" "))
      .explode(column="GO")
      .reset_index(drop=True)
      )

print(df)

Output:

            gene_id          GO
0  LOC_Os02g43120.1  GO:0008270
1  LOC_Os02g43120.1  GO:0005515
2  LOC_Os12g21850.1  GO:0003700
3  LOC_Os12g21850.1  GO:0006355
4  LOC_Os12g21850.1  GO:0005515
5  LOC_Os12g21850.1  GO:0034645
6  LOC_Os12g21850.1  GO:0043565
7  LOC_Os06g30330.1  GO:0005488
8  LOC_Os07g37690.1  GO:0016758
9  LOC_Os07g37690.1  GO:0008152

`df.assign(GO=df['GO'].str.split()).explode('GO', ignore_index=True)[df.columns[::-1]]` — Corralien, Aug 15 '23 at 04:54

split string and add to new row

1 Answers1