0

I am working with a text file containing a table I need to extract specific columns and create a single row (making Sample ID index, and columns should be Gene ID with the values extracted from Nucleotide Change). The code I am using works if there is only on value (Nucleotide Change) per column (Gene ID, after pivot) but does not work wherecolumns would contain multiple values. How should I approach this? Is pivot not an option? I get the error: ValueError: Index contains duplicate entries, cannot reshape. Here is the table from the text file:

Sample ID CHROM POS REF ALT Read Depth Percent Alt Allele Annotation Variant Type Nucleotide Change Position within CDS Amino acid Change REF Amino acid ALT Amino acid Codon Position Gene Name Gene ID
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 7362 G C 114 98.8 Non-synonymous SNP c.61G>C 61 p.Glu21Gln Glu Gln 21 gyrA Rv0006
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1474639 G A 109 98.9 rRNA SNP c.982G>A NA NA NA NA NA rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1476311 GTG CCA 100 11.7 rRNA MNP c.2654_2655delGTGinsCCA NA NA NA NA NA rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1476369 C T 110 10 rRNA SNP c.2712C>T NA NA NA NA NA rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1476463 C T 84 10.7 rRNA SNP c.2806C>T NA NA NA NA NA rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1476481 T C 100 13 rRNA SNP c.2824T>C NA NA NA NA NA rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1476506 T C 92 12.7 rRNA SNP c.2849T>C NA NA NA NA NA rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1673425 C T 90 98.6 Non-Coding SNP c.-15C>T NA NA NA NA NA fabG1 upstream Rv1483 upstream
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 1917972 A G 62 98 Synonymous SNP c.33A>G 33 p.Leu11Leu Leu Leu 11 tlyA Rv1694
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 2154724 C A 70 98.2 Non-synonymous SNP c.1388G>T 1388 p.Arg463Leu Arg Leu 463 katG Rv1908c
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 4247578 G A 101 98.8 Synonymous SNP c.1065G>A 1065 p.Leu355Leu Leu Leu 355 embB Rv3795
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 NC_000962 4247646 A C 110 98.8 Non-synonymous SNP c.1133A>C 1133 p.Glu378Ala Glu Ala 378 embB Rv3795

and here is the code (written to perform task with a directory of files, for now testing it with only one file):

import os
import pandas as pd
from IPython.display import display

work_dir = ('/data/TB/cdc_DR_annotations/')
os.chdir(work_dir)
# initialize an empty dict to hold filetypes, sample IDs, and dataframes
dfs = {
    'target_region_coverage': dict(),
    'loci_Final_annotation': dict()
}
for root, subdirs, files in os.walk(os.getcwd()):
    for file in files:
        for k, d in dfs.items():
            if k in file:
               
                filepath = os.path.join(root, file)
                df = pd.read_csv(filepath, sep='\t',)# index_col=0)
              
                sample_id = file.split('-')[0]
                dfs[k][sample_id] = df

dfs.keys()
# dfs['target_region_coverage'].keys()
dfs['loci_Final_annotation'].keys()               

print(dfs.keys())
test_sample_id = '18RF0375'
test_df = dfs['loci_Final_annotation'][test_sample_id]
test_df.sample(2)

test_df.pivot(
    columns=['Gene Name'], index=['Sample ID'], values=['Nucleotide Change']
)
natural_d
  • 15
  • 5
  • 1
    What's your expected result? – taller_ExcelHome Aug 31 '23 at 00:19
  • From the docs, _This function does not support data aggregation_. What do you want your aggregate function to be if there are multiple values? Min, max, mean, something else? – Reinderien Aug 31 '23 at 00:22
  • a single row for the sample, a column for each gene ID, and the nucleotide changes listed per gene ID(some columns would have multiple values (nucleotide changes) – natural_d Aug 31 '23 at 00:23
  • Multiple values lumped into one cell (e.g. using a list) is generally a bad idea in Pandas. What is the purpose of doing this? – Reinderien Aug 31 '23 at 00:24
  • If this is just for presentation, are you actually looking for cells with string representations of multiple nucleotides, separated with `, `? – Reinderien Aug 31 '23 at 00:34
  • yes that is what I am looking for. it is just for presentation (I will be doing this with a lot of files, one file per sample, and making one large table. If that makes sense). – natural_d Aug 31 '23 at 01:40
  • Use `pivot_table` with `aggfunc=','.join` – mozway Aug 31 '23 at 02:41
  • do you mean, replace .pivot with .pivot_table and aggfunc=','.join goes within the ( ) along with columns, index, and values? – natural_d Aug 31 '23 at 02:50

1 Answers1

0

Here's a wild guess:

from io import StringIO

import pandas as pd

data = '''
Sample ID   CHROM   POS REF ALT Read Depth  Percent Alt Allele  Annotation  Variant Type    Nucleotide Change   Position within CDS Amino acid Change   REF Amino acid  ALT Amino acid  Codon Position  Gene Name   Gene ID
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   7362    G   C   114 98.8    Non-synonymous  SNP c.61G>C 61  p.Glu21Gln  Glu Gln 21  gyrA    Rv0006
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1474639 G   A   109 98.9    rRNA    SNP c.982G>A    NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476311 GTG CCA 100 11.7    rRNA    MNP c.2654_2655delGTGinsCCA NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476369 C   T   110 10  rRNA    SNP c.2712C>T   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476463 C   T   84  10.7    rRNA    SNP c.2806C>T   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476481 T   C   100 13  rRNA    SNP c.2824T>C   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1476506 T   C   92  12.7    rRNA    SNP c.2849T>C   NA  NA  NA  NA  NA  rrl MTB000020
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1673425 C   T   90  98.6    Non-Coding  SNP c.-15C>T    NA  NA  NA  NA  NA  fabG1 upstream  Rv1483 upstream
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   1917972 A   G   62  98  Synonymous  SNP c.33A>G 33  p.Leu11Leu  Leu Leu 11  tlyA    Rv1694
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   2154724 C   A   70  98.2    Non-synonymous  SNP c.1388G>T   1388    p.Arg463Leu Arg Leu 463 katG    Rv1908c
18RF0375-MI-TBWGS-NextSeq-2019-63_S54   NC_000962   4247578 G   A   101 98.8    Synonymous  SNP c.1065G>A   1065    p.Leu355Leu Leu Leu 355 embB    Rv3795
18RF0375-MI-TBWGS-NextSeq-2020-63_S54   NC_000962   4247646 A   C   110 98.8    Non-synonymous  SNP c.1133A>C   1133    p.Glu378Ala Glu Ala 378 embB    Rv3795
'''

with StringIO(data) as f:
    df = pd.read_csv(f, sep='\t', skipinitialspace=True)
df.columns = df.columns.str.strip()

changes = df[[
    'Sample ID', 'Gene Name', 'Nucleotide Change',
]].set_index(
    ['Sample ID', 'Gene Name']
)['Nucleotide Change']

descriptions = changes.groupby(level=[0,1]).agg(', '.join).unstack(level='Gene Name')
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(descriptions)
Gene Name                                   embB fabG1 upstream     gyrA       katG                                                rrl     tlyA
Sample ID                                                                                                                                      
18RF0375-MI-TBWGS-NextSeq-2019-63_S54  c.1065G>A       c.-15C>T  c.61G>C  c.1388G>T  c.982G>A, c.2654_2655delGTGinsCCA, c.2712C>T, ...  c.33A>G
18RF0375-MI-TBWGS-NextSeq-2020-63_S54  c.1133A>C            NaN      NaN        NaN                                                NaN      NaN
Reinderien
  • 11,755
  • 5
  • 49
  • 77