I am working with a text file containing a table I need to extract specific columns and create a single row (making Sample ID index, and columns should be Gene ID with the values extracted from Nucleotide Change). The code I am using works if there is only on value (Nucleotide Change) per column (Gene ID, after pivot) but does not work wherecolumns would contain multiple values. How should I approach this? Is pivot not an option? I get the error: ValueError: Index contains duplicate entries, cannot reshape. Here is the table from the text file:
Sample ID | CHROM | POS | REF | ALT | Read Depth | Percent Alt Allele | Annotation | Variant Type | Nucleotide Change | Position within CDS | Amino acid Change | REF Amino acid | ALT Amino acid | Codon Position | Gene Name | Gene ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 7362 | G | C | 114 | 98.8 | Non-synonymous | SNP | c.61G>C | 61 | p.Glu21Gln | Glu | Gln | 21 | gyrA | Rv0006 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1474639 | G | A | 109 | 98.9 | rRNA | SNP | c.982G>A | NA | NA | NA | NA | NA | rrl | MTB000020 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1476311 | GTG | CCA | 100 | 11.7 | rRNA | MNP | c.2654_2655delGTGinsCCA | NA | NA | NA | NA | NA | rrl | MTB000020 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1476369 | C | T | 110 | 10 | rRNA | SNP | c.2712C>T | NA | NA | NA | NA | NA | rrl | MTB000020 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1476463 | C | T | 84 | 10.7 | rRNA | SNP | c.2806C>T | NA | NA | NA | NA | NA | rrl | MTB000020 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1476481 | T | C | 100 | 13 | rRNA | SNP | c.2824T>C | NA | NA | NA | NA | NA | rrl | MTB000020 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1476506 | T | C | 92 | 12.7 | rRNA | SNP | c.2849T>C | NA | NA | NA | NA | NA | rrl | MTB000020 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1673425 | C | T | 90 | 98.6 | Non-Coding | SNP | c.-15C>T | NA | NA | NA | NA | NA | fabG1 upstream | Rv1483 upstream |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 1917972 | A | G | 62 | 98 | Synonymous | SNP | c.33A>G | 33 | p.Leu11Leu | Leu | Leu | 11 | tlyA | Rv1694 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 2154724 | C | A | 70 | 98.2 | Non-synonymous | SNP | c.1388G>T | 1388 | p.Arg463Leu | Arg | Leu | 463 | katG | Rv1908c |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 4247578 | G | A | 101 | 98.8 | Synonymous | SNP | c.1065G>A | 1065 | p.Leu355Leu | Leu | Leu | 355 | embB | Rv3795 |
18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | NC_000962 | 4247646 | A | C | 110 | 98.8 | Non-synonymous | SNP | c.1133A>C | 1133 | p.Glu378Ala | Glu | Ala | 378 | embB | Rv3795 |
and here is the code (written to perform task with a directory of files, for now testing it with only one file):
import os
import pandas as pd
from IPython.display import display
work_dir = ('/data/TB/cdc_DR_annotations/')
os.chdir(work_dir)
# initialize an empty dict to hold filetypes, sample IDs, and dataframes
dfs = {
'target_region_coverage': dict(),
'loci_Final_annotation': dict()
}
for root, subdirs, files in os.walk(os.getcwd()):
for file in files:
for k, d in dfs.items():
if k in file:
filepath = os.path.join(root, file)
df = pd.read_csv(filepath, sep='\t',)# index_col=0)
sample_id = file.split('-')[0]
dfs[k][sample_id] = df
dfs.keys()
# dfs['target_region_coverage'].keys()
dfs['loci_Final_annotation'].keys()
print(dfs.keys())
test_sample_id = '18RF0375'
test_df = dfs['loci_Final_annotation'][test_sample_id]
test_df.sample(2)
test_df.pivot(
columns=['Gene Name'], index=['Sample ID'], values=['Nucleotide Change']
)