Pandas pivot to create table but columns have multiple values

Question

I am working with a text file containing a table I need to extract specific columns and create a single row (making Sample ID index, and columns should be Gene ID with the values extracted from Nucleotide Change). The code I am using works if there is only on value (Nucleotide Change) per column (Gene ID, after pivot) but does not work wherecolumns would contain multiple values. How should I approach this? Is pivot not an option? I get the error: ValueError: Index contains duplicate entries, cannot reshape. Here is the table from the text file:

Sample ID	CHROM	POS	REF	ALT	Read Depth	Percent Alt Allele	Annotation	Variant Type	Nucleotide Change	Position within CDS	Amino acid Change	REF Amino acid	ALT Amino acid	Codon Position	Gene Name	Gene ID
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	7362	G	C	114	98.8	Non-synonymous	SNP	c.61G>C	61	p.Glu21Gln	Glu	Gln	21	gyrA	Rv0006
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1474639	G	A	109	98.9	rRNA	SNP	c.982G>A	NA	NA	NA	NA	NA	rrl	MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1476311	GTG	CCA	100	11.7	rRNA	MNP	c.2654_2655delGTGinsCCA	NA	NA	NA	NA	NA	rrl	MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1476369	C	T	110	10	rRNA	SNP	c.2712C>T	NA	NA	NA	NA	NA	rrl	MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1476463	C	T	84	10.7	rRNA	SNP	c.2806C>T	NA	NA	NA	NA	NA	rrl	MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1476481	T	C	100	13	rRNA	SNP	c.2824T>C	NA	NA	NA	NA	NA	rrl	MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1476506	T	C	92	12.7	rRNA	SNP	c.2849T>C	NA	NA	NA	NA	NA	rrl	MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1673425	C	T	90	98.6	Non-Coding	SNP	c.-15C>T	NA	NA	NA	NA	NA	fabG1 upstream	Rv1483 upstream
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	1917972	A	G	62	98	Synonymous	SNP	c.33A>G	33	p.Leu11Leu	Leu	Leu	11	tlyA	Rv1694
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	2154724	C	A	70	98.2	Non-synonymous	SNP	c.1388G>T	1388	p.Arg463Leu	Arg	Leu	463	katG	Rv1908c
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	4247578	G	A	101	98.8	Synonymous	SNP	c.1065G>A	1065	p.Leu355Leu	Leu	Leu	355	embB	Rv3795
18RF0375-MI-TBWGS-NextSeq-2019-63_S54	NC_000962	4247646	A	C	110	98.8	Non-synonymous	SNP	c.1133A>C	1133	p.Glu378Ala	Glu	Ala	378	embB	Rv3795

and here is the code (written to perform task with a directory of files, for now testing it with only one file):

import os
import pandas as pd
from IPython.display import display

work_dir = ('/data/TB/cdc_DR_annotations/')
os.chdir(work_dir)
# initialize an empty dict to hold filetypes, sample IDs, and dataframes
dfs = {
    'target_region_coverage': dict(),
    'loci_Final_annotation': dict()
}
for root, subdirs, files in os.walk(os.getcwd()):
    for file in files:
        for k, d in dfs.items():
            if k in file:
               
                filepath = os.path.join(root, file)
                df = pd.read_csv(filepath, sep='\t',)# index_col=0)
              
                sample_id = file.split('-')[0]
                dfs[k][sample_id] = df

dfs.keys()
# dfs['target_region_coverage'].keys()
dfs['loci_Final_annotation'].keys()               

print(dfs.keys())
test_sample_id = '18RF0375'
test_df = dfs['loci_Final_annotation'][test_sample_id]
test_df.sample(2)

test_df.pivot(
    columns=['Gene Name'], index=['Sample ID'], values=['Nucleotide Change']
)

From the docs, _This function does not support data aggregation_. What do you want your aggregate function to be if there are multiple values? Min, max, mean, something else? — Reinderien, Aug 31 '23 at 00:22
a single row for the sample, a column for each gene ID, and the nucleotide changes listed per gene ID(some columns would have multiple values (nucleotide changes) — natural_d, Aug 31 '23 at 00:23
Multiple values lumped into one cell (e.g. using a list) is generally a bad idea in Pandas. What is the purpose of doing this? — Reinderien, Aug 31 '23 at 00:24
If this is just for presentation, are you actually looking for cells with string representations of multiple nucleotides, separated with `, `? — Reinderien, Aug 31 '23 at 00:34
yes that is what I am looking for. it is just for presentation (I will be doing this with a lot of files, one file per sample, and making one large table. If that makes sense). — natural_d, Aug 31 '23 at 01:40
do you mean, replace .pivot with .pivot_table and aggfunc=','.join goes within the ( ) along with columns, index, and values? — natural_d, Aug 31 '23 at 02:50

score 0 · Answer 1 · answered Aug 31 '23 at 00:46

Here's a wild guess:

from io import StringIO

import pandas as pd

data = '''
Sample ID   CHROM   POS REF ALT Read Depth  Percent Alt Allele  Annotation  Variant Type    Nucleotide Change   Position within CDS Amino acid Change   REF Amino acid  ALT Amino acid  Codon Position  Gene Name   Gene ID
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   7362    G   C   114 98.8    Non-synonymous  SNP c.61G>C 61  p.Glu21Gln  Glu Gln 21  gyrA    Rv0006
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1474639 G   A   109 98.9    rRNA    SNP c.982G>A    NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476311 GTG CCA 100 11.7    rRNA    MNP c.2654_2655delGTGinsCCA NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476369 C   T   110 10  rRNA    SNP c.2712C>T   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476463 C   T   84  10.7    rRNA    SNP c.2806C>T   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476481 T   C   100 13  rRNA    SNP c.2824T>C   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476506 T   C   92  12.7    rRNA    SNP c.2849T>C   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1673425 C   T   90  98.6    Non-Coding  SNP c.-15C>T    NA  NA  NA  NA  NA  fabG1 upstream  Rv1483 upstream
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1917972 A   G   62  98  Synonymous  SNP c.33A>G 33  p.Leu11Leu  Leu Leu 11  tlyA    Rv1694
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   2154724 C   A   70  98.2    Non-synonymous  SNP c.1388G>T   1388    p.Arg463Leu Arg Leu 463 katG    Rv1908c
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   4247578 G   A   101 98.8    Synonymous  SNP c.1065G>A   1065    p.Leu355Leu Leu Leu 355 embB    Rv3795
18RF0375-MI-TBWGS-NextSeq-2020-63_S54   NC_000962   4247646 A   C   110 98.8    Non-synonymous  SNP c.1133A>C   1133    p.Glu378Ala Glu Ala 378 embB    Rv3795
'''

with StringIO(data) as f:
    df = pd.read_csv(f, sep='\t', skipinitialspace=True)
df.columns = df.columns.str.strip()

changes = df[[
    'Sample ID', 'Gene Name', 'Nucleotide Change',
]].set_index(
    ['Sample ID', 'Gene Name']
)['Nucleotide Change']

descriptions = changes.groupby(level=[0,1]).agg(', '.join).unstack(level='Gene Name')
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(descriptions)

Gene Name                                   embB fabG1 upstream     gyrA       katG                                                rrl     tlyA
Sample ID                                                                                                                                      
18RF0375-MI-TBWGS-NextSeq-2019-63_S54  c.1065G>A       c.-15C>T  c.61G>C  c.1388G>T  c.982G>A, c.2654_2655delGTGinsCCA, c.2712C>T, ...  c.33A>G
18RF0375-MI-TBWGS-NextSeq-2020-63_S54  c.1133A>C            NaN      NaN        NaN                                                NaN      NaN

Pandas pivot to create table but columns have multiple values

1 Answers1